linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor
@ 2024-10-22 20:40 Lorenzo Stoakes
  2024-10-22 20:40 ` [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
                   ` (10 more replies)
  0 siblings, 11 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.

A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently observed
issues with resource leaks and observable inconsistent state.

This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the VMA
is in an inconsistent state.

The first four patches are intended for backporting to correct the
possibility of people encountering corrupted state while invoking mmap()
which is otherwise at risk of happening.

After this we go further, refactoring the code, placing it in mm/vma.c in
order to make it eventually userland testable, and significantly
simplifying the logic to avoid this issue arising in future.

Lorenzo Stoakes (8):
  mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  mm: unconditionally close VMAs on error
  mm: refactor map_deny_write_exec()
  mm: resolve faulty mmap_region() error path behaviour
  tools: testing: add additional vma_internal.h stubs
  mm: insolate mmap internal logic to mm/vma.c
  mm: refactor __mmap_region()
  mm: do not attempt second merge for file-backed VMAs

 include/linux/mman.h             |  21 +-
 mm/internal.h                    |  44 ++++
 mm/mmap.c                        | 262 ++------------------
 mm/mprotect.c                    |   2 +-
 mm/nommu.c                       |   7 +-
 mm/vma.c                         | 403 ++++++++++++++++++++++++++++++-
 mm/vma.h                         | 101 +-------
 mm/vma_internal.h                |   5 +
 tools/testing/vma/vma_internal.h | 106 +++++++-
 9 files changed, 591 insertions(+), 360 deletions(-)

--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
@ 2024-10-22 20:40 ` Lorenzo Stoakes
  2024-10-22 21:14   ` Jann Horn
                     ` (2 more replies)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

After an attempted mmap() fails, we are no longer in a situation where we
can safely interact with VMA hooks. This is currently not enforced, meaning
that we need complicated handling to ensure we do not incorrectly call
these hooks.

We can avoid the whole issue by treating the VMA as suspect the moment that
the file->f_ops->mmap() function reports an error by replacing whatever VMA
operations were installed with a dummy empty set of VMA operations.

We do so through a new helper function internal to mm - mmap_file() - which
is both more logically named than the existing call_mmap() function and
correctly isolates handling of the vm_op reassignment to mm.

All the existing invocations of call_mmap() outside of mm are ultimately
nested within the call_mmap() from mm, which we now replace.

It is therefore safe to leave call_mmap() in place as a convenience
function (and to avoid churn). The invokers are:

     ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
    coda_file_operations -> mmap -> coda_file_mmap()
     shm_file_operations -> shm_mmap()
shm_file_operations_huge -> shm_mmap()
            dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
	                    -> i915_gem_dmabuf_mmap()

None of these callers interact with vm_ops or mappings in a problematic way
on error, quickly exiting out.

Reported-by: Jann Horn <jannh@google.com>
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Cc: stable <stable@kernel.org>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/internal.h | 27 +++++++++++++++++++++++++++
 mm/mmap.c     |  6 +++---
 mm/nommu.c    |  4 ++--
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 508f7802dd2b..af032e76dfd4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -108,6 +108,33 @@ static inline void *folio_raw_mapping(const struct folio *folio)
 	return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
 }

+/*
+ * This is a file-backed mapping, and is about to be memory mapped - invoke its
+ * mmap hook and safely handle error conditions. On error, VMA hooks will be
+ * mutated.
+ *
+ * @file: File which backs the mapping.
+ * @vma:  VMA which we are mapping.
+ *
+ * Returns: 0 if success, error otherwise.
+ */
+static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
+{
+	int err = call_mmap(file, vma);
+
+	if (likely(!err))
+		return 0;
+
+	/*
+	 * OK, we tried to call the file hook for mmap(), but an error
+	 * arose. The mapping is in an inconsistent state and we most not invoke
+	 * any further hooks on it.
+	 */
+	vma->vm_ops = &vma_dummy_vm_ops;
+
+	return err;
+}
+
 #ifdef CONFIG_MMU

 /* Flags for folio_pte_batch(). */
diff --git a/mm/mmap.c b/mm/mmap.c
index 1ba0878bbc30..10f4ccaf491b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1425,7 +1425,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/*
 	 * clear PTEs while the vma is still in the tree so that rmap
 	 * cannot race with the freeing later in the truncate scenario.
-	 * This is also needed for call_mmap(), which is why vm_ops
+	 * This is also needed for mmap_file(), which is why vm_ops
 	 * close function is called.
 	 */
 	vms_clean_up_area(&vms, &mas_detach);
@@ -1450,7 +1450,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

 	if (file) {
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		error = mmap_file(file, vma);
 		if (error)
 			goto unmap_and_free_vma;

@@ -1473,7 +1473,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

 		vma_iter_config(&vmi, addr, end);
 		/*
-		 * If vm_flags changed after call_mmap(), we should try merge
+		 * If vm_flags changed after mmap_file(), we should try merge
 		 * vma again as we may succeed this time.
 		 */
 		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
diff --git a/mm/nommu.c b/mm/nommu.c
index 385b0c15add8..f9ccc02458ec 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -885,7 +885,7 @@ static int do_mmap_shared_file(struct vm_area_struct *vma)
 {
 	int ret;

-	ret = call_mmap(vma->vm_file, vma);
+	ret = mmap_file(vma->vm_file, vma);
 	if (ret == 0) {
 		vma->vm_region->vm_top = vma->vm_region->vm_end;
 		return 0;
@@ -918,7 +918,7 @@ static int do_mmap_private(struct vm_area_struct *vma,
 	 * happy.
 	 */
 	if (capabilities & NOMMU_MAP_DIRECT) {
-		ret = call_mmap(vma->vm_file, vma);
+		ret = mmap_file(vma->vm_file, vma);
 		/* shouldn't return success if we're not sharing */
 		if (WARN_ON_ONCE(!is_nommu_shared_mapping(vma->vm_flags)))
 			ret = -ENOSYS;
--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
  2024-10-22 20:40 ` [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
@ 2024-10-22 20:40 ` Lorenzo Stoakes
  2024-10-22 21:15   ` Jann Horn
                     ` (3 more replies)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
                   ` (8 subsequent siblings)
  10 siblings, 4 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

Incorrect invocation of VMA callbacks when the VMA is no longer in a
consistent state is bug prone and risky to perform.

With regards to the important vm_ops->close() callback We have gone to
great lengths to try to track whether or not we ought to close VMAs.

Rather than doing so and risking making a mistake somewhere, instead
unconditionally close and reset vma->vm_ops to an empty dummy operations
set with a NULL .close operator.

We introduce a new function to do so - vma_close() - and simplify existing
vms logic which tracked whether we needed to close or not.

This simplifies the logic, avoids incorrect double-calling of the .close()
callback and allows us to update error paths to simply call vma_close()
unconditionally - making VMA closure idempotent.

Reported-by: Jann Horn <jannh@google.com>
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Cc: stable <stable@kernel.org>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/internal.h | 17 +++++++++++++++++
 mm/mmap.c     |  5 ++---
 mm/nommu.c    |  3 +--
 mm/vma.c      | 14 +++++---------
 mm/vma.h      |  4 +---
 5 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index af032e76dfd4..3a45cc592fd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -135,6 +135,23 @@ static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
 	return err;
 }

+/*
+ * Unconditionally close the VMA if it has a close hook and prevent hooks from
+ * being invoked after close. VMA hooks are mutated.
+ */
+static inline void vma_close(struct vm_area_struct *vma)
+{
+	if (vma->vm_ops && vma->vm_ops->close) {
+		vma->vm_ops->close(vma);
+
+		/*
+		 * The mapping is in an inconsistent state, and no further hooks
+		 * may be invoked upon it.
+		 */
+		vma->vm_ops = &vma_dummy_vm_ops;
+	}
+}
+
 #ifdef CONFIG_MMU

 /* Flags for folio_pte_batch(). */
diff --git a/mm/mmap.c b/mm/mmap.c
index 10f4ccaf491b..d55c58e99a54 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1576,8 +1576,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	return addr;

 close_and_free_vma:
-	if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
-		vma->vm_ops->close(vma);
+	vma_close(vma);

 	if (file || vma->vm_file) {
 unmap_and_free_vma:
@@ -1937,7 +1936,7 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
+		remove_vma(vma, /* unreachable = */ true);
 		count++;
 		cond_resched();
 		vma = vma_next(&vmi);
diff --git a/mm/nommu.c b/mm/nommu.c
index f9ccc02458ec..635d028d647b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -589,8 +589,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
  */
 static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
 {
-	if (vma->vm_ops && vma->vm_ops->close)
-		vma->vm_ops->close(vma);
+	vma_close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	put_nommu_region(vma->vm_region);
diff --git a/mm/vma.c b/mm/vma.c
index 3c5a80876725..bb7cfa2dc282 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -323,11 +323,10 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
 /*
  * Close a vm structure and free it.
  */
-void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
+void remove_vma(struct vm_area_struct *vma, bool unreachable)
 {
 	might_sleep();
-	if (!closed && vma->vm_ops && vma->vm_ops->close)
-		vma->vm_ops->close(vma);
+	vma_close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
@@ -1115,9 +1114,7 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
 	vms_clear_ptes(vms, mas_detach, true);
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		if (vma->vm_ops && vma->vm_ops->close)
-			vma->vm_ops->close(vma);
-	vms->closed_vm_ops = true;
+		vma_close(vma);
 }

 /*
@@ -1160,7 +1157,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		remove_vma(vma, /* = */ false, vms->closed_vm_ops);
+		remove_vma(vma, /* unreachable = */ false);

 	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
@@ -1684,8 +1681,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	return new_vma;

 out_vma_link:
-	if (new_vma->vm_ops && new_vma->vm_ops->close)
-		new_vma->vm_ops->close(new_vma);
+	vma_close(new_vma);

 	if (new_vma->vm_file)
 		fput(new_vma->vm_file);
diff --git a/mm/vma.h b/mm/vma.h
index 55457cb68200..75558b5e9c8c 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -42,7 +42,6 @@ struct vma_munmap_struct {
 	int vma_count;                  /* Number of vmas that will be removed */
 	bool unlock;                    /* Unlock after the munmap */
 	bool clear_ptes;                /* If there are outstanding PTE to be cleared */
-	bool closed_vm_ops;		/* call_mmap() was encountered, so vmas may be closed */
 	/* 1 byte hole */
 	unsigned long nr_pages;         /* Number of pages being removed */
 	unsigned long locked_vm;        /* Number of locked pages */
@@ -198,7 +197,6 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->unmap_start = FIRST_USER_ADDRESS;
 	vms->unmap_end = USER_PGTABLES_CEILING;
 	vms->clear_ptes = false;
-	vms->closed_vm_ops = false;
 }
 #endif

@@ -269,7 +267,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
 		  bool unlock);

-void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
+void remove_vma(struct vm_area_struct *vma, bool unreachable);

 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec()
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
  2024-10-22 20:40 ` [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
  2024-10-22 20:40 ` [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
@ 2024-10-22 20:40 ` Lorenzo Stoakes
  2024-10-22 21:15   ` Jann Horn
                     ` (2 more replies)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

Refactor the map_deny_write_exec() to not unnecessarily require a VMA
parameter but rather to accept VMA flags parameters, which allows us to use
this function early in mmap_region() in a subsequent commit.

While we're here, we refactor the function to be more readable and add some
additional documentation.

Reported-by: Jann Horn <jannh@google.com>
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Cc: stable <stable@kernel.org>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mman.h | 21 ++++++++++++++++++---
 mm/mmap.c            |  2 +-
 mm/mprotect.c        |  2 +-
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index bcb201ab7a41..d71f83b821ef 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -188,16 +188,31 @@ static inline bool arch_memory_deny_write_exec_supported(void)
  *
  *	d)	mmap(PROT_READ | PROT_EXEC)
  *		mmap(PROT_READ | PROT_EXEC | PROT_BTI)
+ *
+ * This is only applicable if the user has set the Memory-Deny-Write-Execute
+ * (MDWE) protection mask for the current process.
+ *
+ * @old specifies the VMA flags the VMA originally possessed, and @new the ones
+ * we propose to set.
+ *
+ * Return: false if proposed change is OK, true if not ok and should be denied.
  */
-static inline bool map_deny_write_exec(struct vm_area_struct *vma,  unsigned long vm_flags)
+static inline bool map_deny_write_exec(unsigned long old, unsigned long new)
 {
+	/* If MDWE is disabled, we have nothing to deny. */
 	if (!test_bit(MMF_HAS_MDWE, &current->mm->flags))
 		return false;

-	if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
+	/* If the new VMA is not executable, we have nothing to deny. */
+	if (!(new & VM_EXEC))
+		return false;
+
+	/* Under MDWE we absolutely do not accept writably executable... */
+	if (new & VM_WRITE)
 		return true;

-	if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC))
+	/* ...nor newly executable VMAs. */
+	if (!(old & VM_EXEC))
 		return true;

 	return false;
diff --git a/mm/mmap.c b/mm/mmap.c
index d55c58e99a54..66edf0ebba94 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1508,7 +1508,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vma_set_anonymous(vma);
 	}

-	if (map_deny_write_exec(vma, vma->vm_flags)) {
+	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
 		error = -EACCES;
 		goto close_and_free_vma;
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0c5d6d06107d..6f450af3252e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -810,7 +810,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 			break;
 		}

-		if (map_deny_write_exec(vma, newflags)) {
+		if (map_deny_write_exec(vma->vm_flags, newflags)) {
 			error = -EACCES;
 			break;
 		}
--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
@ 2024-10-22 20:40 ` Lorenzo Stoakes
  2024-10-23 12:58   ` Vlastimil Babka
  2024-10-23 15:02   ` Liam R. Howlett
  2024-10-22 20:40 ` [PATCH hotfix 6.12 5/8] tools: testing: add additional vma_internal.h stubs Lorenzo Stoakes
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.

A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently observed
issues with resource leaks and observable inconsistent state.

Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().

Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.

We move a number of things here:

1. We preallocate memory for the iterator before we call the file-backed
   memory hook, allowing us to exit early and avoid having to perform
   complicated and error-prone close/free logic. We carefully free
   iterator state on both success and error paths.

2. The enclosing mmap_region() function handles the mapping_map_writable()
   logic early. Previously the logic had the mapping_map_writable() at the
   point of mapping a newly allocated file-backed VMA, and a matching
   mapping_unmap_writable() on success and error paths.

   We now do this unconditionally if this is a file-backed, shared writable
   mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
   doing so does not invalidate the seal check we just performed, and we in
   any case always decrement the counter in the wrapper.

   We perform a debug assert to ensure a driver does not attempt to do the
   opposite.

3. We also move arch_validate_flags() up into the mmap_region()
   function. This is only relevant on arm64 and sparc64, and the check is
   only meaningful for SPARC with ADI enabled. We explicitly add a warning
   for this arch if a driver invalidates this check, though the code ought
   eventually to be fixed to eliminate the need for this.

With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.

This eliminates an entire class of errors, makes the code easier to reason
about and more robust.

Reported-by: Jann Horn <jannh@google.com>
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Cc: stable <stable@kernel.org>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 120 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 66 insertions(+), 54 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 66edf0ebba94..7d02b47a1895 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1361,20 +1361,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
 }

-unsigned long mmap_region(struct file *file, unsigned long addr,
+static unsigned long __mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
 		struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	pgoff_t pglen = PHYS_PFN(len);
-	struct vm_area_struct *merge;
 	unsigned long charged = 0;
 	struct vma_munmap_struct vms;
 	struct ma_state mas_detach;
 	struct maple_tree mt_detach;
 	unsigned long end = addr + len;
-	bool writable_file_mapping = false;
 	int error;
 	VMA_ITERATOR(vmi, mm, addr);
 	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
@@ -1448,28 +1446,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vm_flags_init(vma, vm_flags);
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);

+	if (vma_iter_prealloc(&vmi, vma)) {
+		error = -ENOMEM;
+		goto free_vma;
+	}
+
 	if (file) {
 		vma->vm_file = get_file(file);
 		error = mmap_file(file, vma);
 		if (error)
-			goto unmap_and_free_vma;
-
-		if (vma_is_shared_maywrite(vma)) {
-			error = mapping_map_writable(file->f_mapping);
-			if (error)
-				goto close_and_free_vma;
-
-			writable_file_mapping = true;
-		}
+			goto unmap_and_free_file_vma;

+		/* Drivers cannot alter the address of the VMA. */
+		WARN_ON_ONCE(addr != vma->vm_start);
 		/*
-		 * Expansion is handled above, merging is handled below.
-		 * Drivers should not alter the address of the VMA.
+		 * Drivers should not permit writability when previously it was
+		 * disallowed.
 		 */
-		if (WARN_ON((addr != vma->vm_start))) {
-			error = -EINVAL;
-			goto close_and_free_vma;
-		}
+		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
+				!(vm_flags & VM_MAYWRITE) &&
+				(vma->vm_flags & VM_MAYWRITE));

 		vma_iter_config(&vmi, addr, end);
 		/*
@@ -1477,6 +1473,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * vma again as we may succeed this time.
 		 */
 		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
+			struct vm_area_struct *merge;
+
 			vmg.flags = vma->vm_flags;
 			/* If this fails, state is reset ready for a reattempt. */
 			merge = vma_merge_new_range(&vmg);
@@ -1491,10 +1489,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 				 */
 				fput(vma->vm_file);
 				vm_area_free(vma);
+				vma_iter_free(&vmi);
 				vma = merge;
 				/* Update vm_flags to pick up the change. */
 				vm_flags = vma->vm_flags;
-				goto unmap_writable;
+				goto file_expanded;
 			}
 			vma_iter_config(&vmi, addr, end);
 		}
@@ -1503,26 +1502,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
-			goto free_vma;
+			goto free_iter_vma;
 	} else {
 		vma_set_anonymous(vma);
 	}

-	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
-		error = -EACCES;
-		goto close_and_free_vma;
-	}
-
-	/* Allow architectures to sanity-check the vm_flags */
-	if (!arch_validate_flags(vma->vm_flags)) {
-		error = -EINVAL;
-		goto close_and_free_vma;
-	}
-
-	if (vma_iter_prealloc(&vmi, vma)) {
-		error = -ENOMEM;
-		goto close_and_free_vma;
-	}
+#ifdef CONFIG_SPARC64
+	/* TODO: Fix SPARC ADI! */
+	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+#endif

 	/* Lock the VMA since it is modified after insertion into VMA tree */
 	vma_start_write(vma);
@@ -1536,10 +1524,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	khugepaged_enter_vma(vma, vma->vm_flags);

-	/* Once vma denies write, undo our temporary denial count */
-unmap_writable:
-	if (writable_file_mapping)
-		mapping_unmap_writable(file->f_mapping);
+file_expanded:
 	file = vma->vm_file;
 	ksm_add_vma(vma);
 expanded:
@@ -1572,23 +1557,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

 	vma_set_page_prot(vma);

-	validate_mm(mm);
 	return addr;

-close_and_free_vma:
-	vma_close(vma);
-
-	if (file || vma->vm_file) {
-unmap_and_free_vma:
-		fput(vma->vm_file);
-		vma->vm_file = NULL;
+unmap_and_free_file_vma:
+	fput(vma->vm_file);
+	vma->vm_file = NULL;

-		vma_iter_set(&vmi, vma->vm_end);
-		/* Undo any partial mapping done by a device driver. */
-		unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
-	}
-	if (writable_file_mapping)
-		mapping_unmap_writable(file->f_mapping);
+	vma_iter_set(&vmi, vma->vm_end);
+	/* Undo any partial mapping done by a device driver. */
+	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
+free_iter_vma:
+	vma_iter_free(&vmi);
 free_vma:
 	vm_area_free(vma);
 unacct_error:
@@ -1598,10 +1577,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 abort_munmap:
 	vms_abort_munmap_vmas(&vms, &mas_detach);
 gather_failed:
-	validate_mm(mm);
 	return error;
 }

+unsigned long mmap_region(struct file *file, unsigned long addr,
+			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+			  struct list_head *uf)
+{
+	unsigned long ret;
+	bool writable_file_mapping = false;
+
+	/* Allow architectures to sanity-check the vm_flags. */
+	if (!arch_validate_flags(vm_flags))
+		return -EINVAL;
+
+	/* Check to see if MDWE is applicable. */
+	if (map_deny_write_exec(vm_flags, vm_flags))
+		return -EACCES;
+
+	/* Map writable and ensure this isn't a sealed memfd. */
+	if (file && is_shared_maywrite(vm_flags)) {
+		int error = mapping_map_writable(file->f_mapping);
+
+		if (error)
+			return error;
+		writable_file_mapping = true;
+	}
+
+	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
+
+	/* Clear our write mapping regardless of error. */
+	if (writable_file_mapping)
+		mapping_unmap_writable(file->f_mapping);
+
+	validate_mm(current->mm);
+	return ret;
+}
+
 static int __vm_munmap(unsigned long start, size_t len, bool unlock)
 {
 	int ret;
--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH hotfix 6.12 5/8] tools: testing: add additional vma_internal.h stubs
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
@ 2024-10-22 20:40 ` Lorenzo Stoakes
  2024-10-22 20:40 ` [PATCH hotfix 6.12 6/8] mm: insolate mmap internal logic to mm/vma.c Lorenzo Stoakes
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

Add some additional vma_internal.h stubs in preparation for __mmap_region()
being moved to mm/vma.c. Without these the move would result in the tests
no longer compiling.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/vma/vma_internal.h | 106 ++++++++++++++++++++++++++++++-
 1 file changed, 105 insertions(+), 1 deletion(-)

diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index c5b9da034511..bec50a8c846c 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -44,7 +44,9 @@
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000
 #define VM_DONTEXPAND	0x00040000
+#define VM_LOCKONFAULT	0x00080000
 #define VM_ACCOUNT	0x00100000
+#define VM_NORESERVE	0x00200000
 #define VM_MIXEDMAP	0x10000000
 #define VM_STACK	VM_GROWSDOWN
 #define VM_SHADOW_STACK	VM_NONE
@@ -53,6 +55,14 @@
 #define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
 #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)

+/* This mask represents all the VMA flag bits used by mlock */
+#define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
+
+#ifdef CONFIG_64BIT
+/* VM is sealed, in vm_flags */
+#define VM_SEALED	_BITUL(63)
+#endif
+
 #define FIRST_USER_ADDRESS	0UL
 #define USER_PGTABLES_CEILING	0UL

@@ -698,8 +708,9 @@ static inline void tlb_finish_mmu(struct mmu_gather *)
 {
 }

-static inline void get_file(struct file *)
+static inline struct file *get_file(struct file *f)
 {
+	return f;
 }

 static inline int vma_dup_policy(struct vm_area_struct *, struct vm_area_struct *)
@@ -920,4 +931,97 @@ static inline bool signal_pending(void *)
 	return false;
 }

+static inline bool is_file_hugepages(struct file *)
+{
+	return false;
+}
+
+int security_vm_enough_memory_mm(struct mm_struct *, long)
+{
+	return true;
+}
+
+bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long)
+{
+	return true;
+}
+
+static inline void vm_flags_init(struct vm_area_struct *vma,
+				 vm_flags_t flags)
+{
+	vma->__vm_flags = flags;
+}
+
+static inline void vm_flags_set(struct vm_area_struct *vma,
+				vm_flags_t flags)
+{
+	vma_start_write(vma);
+	vma->__vm_flags |= flags;
+}
+
+static inline void vm_flags_clear(struct vm_area_struct *vma,
+				  vm_flags_t flags)
+{
+	vma_start_write(vma);
+	vma->__vm_flags &= ~flags;
+}
+
+static inline int call_mmap(struct file *, struct vm_area_struct *)
+{
+	return 0;
+}
+
+static inline int shmem_zero_setup(struct vm_area_struct *)
+{
+	return 0;
+}
+
+static inline void vma_set_anonymous(struct vm_area_struct *vma)
+{
+	vma->vm_ops = NULL;
+}
+
+static inline void ksm_add_vma(struct vm_area_struct *)
+{
+}
+
+static inline void perf_event_mmap(struct vm_area_struct *)
+{
+}
+
+static bool vma_is_dax(struct vm_area_struct *)
+{
+	return false;
+}
+
+struct vm_area_struct *get_gate_vma(struct mm_struct *)
+{
+	return NULL;
+}
+
+bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
+
+/* Update vma->vm_page_prot to reflect vma->vm_flags. */
+static inline void vma_set_page_prot(struct vm_area_struct *vma)
+{
+	unsigned long vm_flags = vma->vm_flags;
+	pgprot_t vm_page_prot;
+
+	/* testing: we inline vm_pgprot_modify() to avoid clash with vma.h. */
+	vm_page_prot = pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vm_flags));
+
+	if (vma_wants_writenotify(vma, vm_page_prot)) {
+		vm_flags &= ~VM_SHARED;
+		/* testing: we inline vm_pgprot_modify() to avoid clash with vma.h. */
+		vm_page_prot = pgprot_modify(vm_page_prot, vm_get_page_prot(vm_flags));
+	}
+	/* remove_protection_ptes reads vma->vm_page_prot without mmap_lock */
+	WRITE_ONCE(vma->vm_page_prot, vm_page_prot);
+}
+
+static inline bool arch_validate_flags(unsigned long)
+{
+	return true;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH hotfix 6.12 6/8] mm: insolate mmap internal logic to mm/vma.c
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 5/8] tools: testing: add additional vma_internal.h stubs Lorenzo Stoakes
@ 2024-10-22 20:40 ` Lorenzo Stoakes
  2024-10-22 20:40 ` [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

In previous commits we effected improvements to the mmap() logic in
mmap_region() and its newly introduced internal implementation function
__mmap_region().

However as these changes are intended to be backported, we kept the delta
as small as is possible and made as few changes as possible to the newly
introduced mm/vma.* files.

Take the opportunity to move this logic to mm/vma.c which not only isolates
it, but also makes it available for later userland testing which can help
us catch such logic errors far earlier.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c         | 235 ---------------------------------
 mm/vma.c          | 324 +++++++++++++++++++++++++++++++++++++++++++++-
 mm/vma.h          |  97 +-------------
 mm/vma_internal.h |   5 +
 4 files changed, 330 insertions(+), 331 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 7d02b47a1895..1adf2c1cc3c0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -577,22 +577,6 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
 }
 #endif /* __ARCH_WANT_SYS_OLD_MMAP */

-/*
- * We account for memory if it's a private writeable mapping,
- * not hugepages and VM_NORESERVE wasn't set.
- */
-static inline bool accountable_mapping(struct file *file, vm_flags_t vm_flags)
-{
-	/*
-	 * hugetlb has its own accounting separate from the core VM
-	 * VM_HUGETLB may not be set yet so we cannot check for that flag.
-	 */
-	if (file && is_file_hugepages(file))
-		return false;
-
-	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
-}
-
 /**
  * unmapped_area() - Find an area between the low_limit and the high_limit with
  * the correct alignment and offset, all from @info. Note: current->mm is used
@@ -1361,225 +1345,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
 }

-static unsigned long __mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
-{
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	pgoff_t pglen = PHYS_PFN(len);
-	unsigned long charged = 0;
-	struct vma_munmap_struct vms;
-	struct ma_state mas_detach;
-	struct maple_tree mt_detach;
-	unsigned long end = addr + len;
-	int error;
-	VMA_ITERATOR(vmi, mm, addr);
-	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
-
-	vmg.file = file;
-	/* Find the first overlapping VMA */
-	vma = vma_find(&vmi, end);
-	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
-	if (vma) {
-		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
-		mt_on_stack(mt_detach);
-		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
-		/* Prepare to unmap any existing mapping in the area */
-		error = vms_gather_munmap_vmas(&vms, &mas_detach);
-		if (error)
-			goto gather_failed;
-
-		vmg.next = vms.next;
-		vmg.prev = vms.prev;
-		vma = NULL;
-	} else {
-		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
-	}
-
-	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
-		error = -ENOMEM;
-		goto abort_munmap;
-	}
-
-	/*
-	 * Private writable mapping: check memory availability
-	 */
-	if (accountable_mapping(file, vm_flags)) {
-		charged = pglen;
-		charged -= vms.nr_accounted;
-		if (charged) {
-			error = security_vm_enough_memory_mm(mm, charged);
-			if (error)
-				goto abort_munmap;
-		}
-
-		vms.nr_accounted = 0;
-		vm_flags |= VM_ACCOUNT;
-		vmg.flags = vm_flags;
-	}
-
-	/*
-	 * clear PTEs while the vma is still in the tree so that rmap
-	 * cannot race with the freeing later in the truncate scenario.
-	 * This is also needed for mmap_file(), which is why vm_ops
-	 * close function is called.
-	 */
-	vms_clean_up_area(&vms, &mas_detach);
-	vma = vma_merge_new_range(&vmg);
-	if (vma)
-		goto expanded;
-	/*
-	 * Determine the object being mapped and call the appropriate
-	 * specific mapper. the address has already been validated, but
-	 * not unmapped, but the maps are removed from the list.
-	 */
-	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
-		goto unacct_error;
-	}
-
-	vma_iter_config(&vmi, addr, end);
-	vma_set_range(vma, addr, end, pgoff);
-	vm_flags_init(vma, vm_flags);
-	vma->vm_page_prot = vm_get_page_prot(vm_flags);
-
-	if (vma_iter_prealloc(&vmi, vma)) {
-		error = -ENOMEM;
-		goto free_vma;
-	}
-
-	if (file) {
-		vma->vm_file = get_file(file);
-		error = mmap_file(file, vma);
-		if (error)
-			goto unmap_and_free_file_vma;
-
-		/* Drivers cannot alter the address of the VMA. */
-		WARN_ON_ONCE(addr != vma->vm_start);
-		/*
-		 * Drivers should not permit writability when previously it was
-		 * disallowed.
-		 */
-		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
-				!(vm_flags & VM_MAYWRITE) &&
-				(vma->vm_flags & VM_MAYWRITE));
-
-		vma_iter_config(&vmi, addr, end);
-		/*
-		 * If vm_flags changed after mmap_file(), we should try merge
-		 * vma again as we may succeed this time.
-		 */
-		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
-			struct vm_area_struct *merge;
-
-			vmg.flags = vma->vm_flags;
-			/* If this fails, state is reset ready for a reattempt. */
-			merge = vma_merge_new_range(&vmg);
-
-			if (merge) {
-				/*
-				 * ->mmap() can change vma->vm_file and fput
-				 * the original file. So fput the vma->vm_file
-				 * here or we would add an extra fput for file
-				 * and cause general protection fault
-				 * ultimately.
-				 */
-				fput(vma->vm_file);
-				vm_area_free(vma);
-				vma_iter_free(&vmi);
-				vma = merge;
-				/* Update vm_flags to pick up the change. */
-				vm_flags = vma->vm_flags;
-				goto file_expanded;
-			}
-			vma_iter_config(&vmi, addr, end);
-		}
-
-		vm_flags = vma->vm_flags;
-	} else if (vm_flags & VM_SHARED) {
-		error = shmem_zero_setup(vma);
-		if (error)
-			goto free_iter_vma;
-	} else {
-		vma_set_anonymous(vma);
-	}
-
-#ifdef CONFIG_SPARC64
-	/* TODO: Fix SPARC ADI! */
-	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
-#endif
-
-	/* Lock the VMA since it is modified after insertion into VMA tree */
-	vma_start_write(vma);
-	vma_iter_store(&vmi, vma);
-	mm->map_count++;
-	vma_link_file(vma);
-
-	/*
-	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
-	 * call covers the non-merge case.
-	 */
-	khugepaged_enter_vma(vma, vma->vm_flags);
-
-file_expanded:
-	file = vma->vm_file;
-	ksm_add_vma(vma);
-expanded:
-	perf_event_mmap(vma);
-
-	/* Unmap any existing mapping in the area */
-	vms_complete_munmap_vmas(&vms, &mas_detach);
-
-	vm_stat_account(mm, vm_flags, pglen);
-	if (vm_flags & VM_LOCKED) {
-		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
-					is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm))
-			vm_flags_clear(vma, VM_LOCKED_MASK);
-		else
-			mm->locked_vm += pglen;
-	}
-
-	if (file)
-		uprobe_mmap(vma);
-
-	/*
-	 * New (or expanded) vma always get soft dirty status.
-	 * Otherwise user-space soft-dirty page tracker won't
-	 * be able to distinguish situation when vma area unmapped,
-	 * then new mapped in-place (which must be aimed as
-	 * a completely new data area).
-	 */
-	vm_flags_set(vma, VM_SOFTDIRTY);
-
-	vma_set_page_prot(vma);
-
-	return addr;
-
-unmap_and_free_file_vma:
-	fput(vma->vm_file);
-	vma->vm_file = NULL;
-
-	vma_iter_set(&vmi, vma->vm_end);
-	/* Undo any partial mapping done by a device driver. */
-	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
-free_iter_vma:
-	vma_iter_free(&vmi);
-free_vma:
-	vm_area_free(vma);
-unacct_error:
-	if (charged)
-		vm_unacct_memory(charged);
-
-abort_munmap:
-	vms_abort_munmap_vmas(&vms, &mas_detach);
-gather_failed:
-	return error;
-}
-
 unsigned long mmap_region(struct file *file, unsigned long addr,
 			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
 			  struct list_head *uf)
diff --git a/mm/vma.c b/mm/vma.c
index bb7cfa2dc282..7617f9d50d62 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1103,7 +1103,7 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
 	vms->clear_ptes = false;
 }

-void vms_clean_up_area(struct vma_munmap_struct *vms,
+static void vms_clean_up_area(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *vma;
@@ -1126,7 +1126,7 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
  * used for the munmap() and may downgrade the lock - if requested.  Everything
  * needed to be done once the vma maple tree is updated.
  */
-void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *vma;
@@ -1167,6 +1167,23 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	__mt_destroy(mas_detach->tree);
 }

+/*
+ * reattach_vmas() - Undo any munmap work and free resources
+ * @mas_detach: The maple state with the detached maple tree
+ *
+ * Reattach any detached vmas and free up the maple tree used to track the vmas.
+ */
+static void reattach_vmas(struct ma_state *mas_detach)
+{
+	struct vm_area_struct *vma;
+
+	mas_set(mas_detach, 0);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		vma_mark_detached(vma, false);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
  * for removal at a later date.  Handles splitting first and last if necessary
@@ -1177,7 +1194,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
  *
  * Return: 0 on success, error otherwise
  */
-int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *next = NULL;
@@ -1315,6 +1332,39 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	return error;
 }

+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation.  Only unlocked on success
+ */
+static void init_vma_munmap(struct vma_munmap_struct *vms,
+		struct vma_iterator *vmi, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, struct list_head *uf,
+		bool unlock)
+{
+	vms->vmi = vmi;
+	vms->vma = vma;
+	if (vma) {
+		vms->start = start;
+		vms->end = end;
+	} else {
+		vms->start = vms->end = 0;
+	}
+	vms->unlock = unlock;
+	vms->uf = uf;
+	vms->vma_count = 0;
+	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+	vms->unmap_start = FIRST_USER_ADDRESS;
+	vms->unmap_end = USER_PGTABLES_CEILING;
+	vms->clear_ptes = false;
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -2069,3 +2119,271 @@ void mm_drop_all_locks(struct mm_struct *mm)

 	mutex_unlock(&mm_all_locks_mutex);
 }
+
+/*
+ * We account for memory if it's a private writeable mapping,
+ * not hugepages and VM_NORESERVE wasn't set.
+ */
+static bool accountable_mapping(struct file *file, vm_flags_t vm_flags)
+{
+	/*
+	 * hugetlb has its own accounting separate from the core VM
+	 * VM_HUGETLB may not be set yet so we cannot check for that flag.
+	 */
+	if (file && is_file_hugepages(file))
+		return false;
+
+	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
+}
+
+/*
+ * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
+ * operation.
+ * @vms: The vma unmap structure
+ * @mas_detach: The maple state with the detached maple tree
+ *
+ * Reattach any detached vmas, free up the maple tree used to track the vmas.
+ * If that's not possible because the ptes are cleared (and vm_ops->closed() may
+ * have been called), then a NULL is written over the vmas and the vmas are
+ * removed (munmap() completed).
+ */
+static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
+{
+	struct ma_state *mas = &vms->vmi->mas;
+
+	if (!vms->nr_pages)
+		return;
+
+	if (vms->clear_ptes)
+		return reattach_vmas(mas_detach);
+
+	/*
+	 * Aborting cannot just call the vm_ops open() because they are often
+	 * not symmetrical and state data has been lost.  Resort to the old
+	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
+	 */
+	mas_set_range(mas, vms->start, vms->end - 1);
+	mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
+	/* Clean up the insertion of the unfortunate gap */
+	vms_complete_munmap_vmas(vms, mas_detach);
+}
+
+unsigned long __mmap_region(struct file *file, unsigned long addr,
+		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+		struct list_head *uf)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	pgoff_t pglen = PHYS_PFN(len);
+	unsigned long charged = 0;
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
+	struct maple_tree mt_detach;
+	unsigned long end = addr + len;
+	int error;
+	VMA_ITERATOR(vmi, mm, addr);
+	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
+
+	vmg.file = file;
+	/* Find the first overlapping VMA */
+	vma = vma_find(&vmi, end);
+	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
+	if (vma) {
+		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(mt_detach);
+		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+		/* Prepare to unmap any existing mapping in the area */
+		error = vms_gather_munmap_vmas(&vms, &mas_detach);
+		if (error)
+			goto gather_failed;
+
+		vmg.next = vms.next;
+		vmg.prev = vms.prev;
+		vma = NULL;
+	} else {
+		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
+	}
+
+	/* Check against address space limit. */
+	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
+		error = -ENOMEM;
+		goto abort_munmap;
+	}
+
+	/*
+	 * Private writable mapping: check memory availability
+	 */
+	if (accountable_mapping(file, vm_flags)) {
+		charged = pglen;
+		charged -= vms.nr_accounted;
+		if (charged) {
+			error = security_vm_enough_memory_mm(mm, charged);
+			if (error)
+				goto abort_munmap;
+		}
+
+		vms.nr_accounted = 0;
+		vm_flags |= VM_ACCOUNT;
+		vmg.flags = vm_flags;
+	}
+
+	/*
+	 * clear PTEs while the vma is still in the tree so that rmap
+	 * cannot race with the freeing later in the truncate scenario.
+	 * This is also needed for mmap_file(), which is why vm_ops
+	 * close function is called.
+	 */
+	vms_clean_up_area(&vms, &mas_detach);
+	vma = vma_merge_new_range(&vmg);
+	if (vma)
+		goto expanded;
+	/*
+	 * Determine the object being mapped and call the appropriate
+	 * specific mapper. the address has already been validated, but
+	 * not unmapped, but the maps are removed from the list.
+	 */
+	vma = vm_area_alloc(mm);
+	if (!vma) {
+		error = -ENOMEM;
+		goto unacct_error;
+	}
+
+	vma_iter_config(&vmi, addr, end);
+	vma_set_range(vma, addr, end, pgoff);
+	vm_flags_init(vma, vm_flags);
+	vma->vm_page_prot = vm_get_page_prot(vm_flags);
+
+	if (vma_iter_prealloc(&vmi, vma)) {
+		error = -ENOMEM;
+		goto free_vma;
+	}
+
+	if (file) {
+		vma->vm_file = get_file(file);
+		error = mmap_file(file, vma);
+		if (error)
+			goto unmap_and_free_file_vma;
+
+		/* Drivers cannot alter the address of the VMA. */
+		WARN_ON_ONCE(addr != vma->vm_start);
+		/*
+		 * Drivers should not permit writability when previously it was
+		 * disallowed.
+		 */
+		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
+				!(vm_flags & VM_MAYWRITE) &&
+				(vma->vm_flags & VM_MAYWRITE));
+
+		vma_iter_config(&vmi, addr, end);
+		/*
+		 * If vm_flags changed after mmap_file(), we should try merge
+		 * vma again as we may succeed this time.
+		 */
+		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
+			struct vm_area_struct *merge;
+
+			vmg.flags = vma->vm_flags;
+			/* If this fails, state is reset ready for a reattempt. */
+			merge = vma_merge_new_range(&vmg);
+
+			if (merge) {
+				/*
+				 * ->mmap() can change vma->vm_file and fput
+				 * the original file. So fput the vma->vm_file
+				 * here or we would add an extra fput for file
+				 * and cause general protection fault
+				 * ultimately.
+				 */
+				fput(vma->vm_file);
+				vm_area_free(vma);
+				vma_iter_free(&vmi);
+				vma = merge;
+				/* Update vm_flags to pick up the change. */
+				vm_flags = vma->vm_flags;
+				goto file_expanded;
+			}
+			vma_iter_config(&vmi, addr, end);
+		}
+
+		vm_flags = vma->vm_flags;
+	} else if (vm_flags & VM_SHARED) {
+		error = shmem_zero_setup(vma);
+		if (error)
+			goto free_iter_vma;
+	} else {
+		vma_set_anonymous(vma);
+	}
+
+#ifdef CONFIG_SPARC64
+	/* TODO: Fix SPARC ADI! */
+	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+#endif
+
+	/* Lock the VMA since it is modified after insertion into VMA tree */
+	vma_start_write(vma);
+	vma_iter_store(&vmi, vma);
+	mm->map_count++;
+	vma_link_file(vma);
+
+	/*
+	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
+	 * call covers the non-merge case.
+	 */
+	khugepaged_enter_vma(vma, vma->vm_flags);
+
+file_expanded:
+	file = vma->vm_file;
+	ksm_add_vma(vma);
+expanded:
+	perf_event_mmap(vma);
+
+	/* Unmap any existing mapping in the area */
+	vms_complete_munmap_vmas(&vms, &mas_detach);
+
+	vm_stat_account(mm, vm_flags, pglen);
+	if (vm_flags & VM_LOCKED) {
+		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
+					is_vm_hugetlb_page(vma) ||
+					vma == get_gate_vma(current->mm))
+			vm_flags_clear(vma, VM_LOCKED_MASK);
+		else
+			mm->locked_vm += pglen;
+	}
+
+	if (file)
+		uprobe_mmap(vma);
+
+	/*
+	 * New (or expanded) vma always get soft dirty status.
+	 * Otherwise user-space soft-dirty page tracker won't
+	 * be able to distinguish situation when vma area unmapped,
+	 * then new mapped in-place (which must be aimed as
+	 * a completely new data area).
+	 */
+	vm_flags_set(vma, VM_SOFTDIRTY);
+
+	vma_set_page_prot(vma);
+
+	return addr;
+
+unmap_and_free_file_vma:
+	fput(vma->vm_file);
+	vma->vm_file = NULL;
+
+	vma_iter_set(&vmi, vma->vm_end);
+	/* Undo any partial mapping done by a device driver. */
+	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
+free_iter_vma:
+	vma_iter_free(&vmi);
+free_vma:
+	vm_area_free(vma);
+unacct_error:
+	if (charged)
+		vm_unacct_memory(charged);
+
+abort_munmap:
+	vms_abort_munmap_vmas(&vms, &mas_detach);
+gather_failed:
+	return error;
+}
diff --git a/mm/vma.h b/mm/vma.h
index 75558b5e9c8c..eb15616e4e2c 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -165,99 +165,6 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	return 0;
 }

-#ifdef CONFIG_MMU
-/*
- * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
- * @vms: The vma munmap struct
- * @vmi: The vma iterator
- * @vma: The first vm_area_struct to munmap
- * @start: The aligned start address to munmap
- * @end: The aligned end address to munmap
- * @uf: The userfaultfd list_head
- * @unlock: Unlock after the operation.  Only unlocked on success
- */
-static inline void init_vma_munmap(struct vma_munmap_struct *vms,
-		struct vma_iterator *vmi, struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, struct list_head *uf,
-		bool unlock)
-{
-	vms->vmi = vmi;
-	vms->vma = vma;
-	if (vma) {
-		vms->start = start;
-		vms->end = end;
-	} else {
-		vms->start = vms->end = 0;
-	}
-	vms->unlock = unlock;
-	vms->uf = uf;
-	vms->vma_count = 0;
-	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
-	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
-	vms->unmap_start = FIRST_USER_ADDRESS;
-	vms->unmap_end = USER_PGTABLES_CEILING;
-	vms->clear_ptes = false;
-}
-#endif
-
-int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach);
-
-void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach);
-
-void vms_clean_up_area(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach);
-
-/*
- * reattach_vmas() - Undo any munmap work and free resources
- * @mas_detach: The maple state with the detached maple tree
- *
- * Reattach any detached vmas and free up the maple tree used to track the vmas.
- */
-static inline void reattach_vmas(struct ma_state *mas_detach)
-{
-	struct vm_area_struct *vma;
-
-	mas_set(mas_detach, 0);
-	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
-
-	__mt_destroy(mas_detach->tree);
-}
-
-/*
- * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
- * operation.
- * @vms: The vma unmap structure
- * @mas_detach: The maple state with the detached maple tree
- *
- * Reattach any detached vmas, free up the maple tree used to track the vmas.
- * If that's not possible because the ptes are cleared (and vm_ops->closed() may
- * have been called), then a NULL is written over the vmas and the vmas are
- * removed (munmap() completed).
- */
-static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach)
-{
-	struct ma_state *mas = &vms->vmi->mas;
-	if (!vms->nr_pages)
-		return;
-
-	if (vms->clear_ptes)
-		return reattach_vmas(mas_detach);
-
-	/*
-	 * Aborting cannot just call the vm_ops open() because they are often
-	 * not symmetrical and state data has been lost.  Resort to the old
-	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
-	 */
-	mas_set_range(mas, vms->start, vms->end - 1);
-	mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
-	/* Clean up the insertion of the unfortunate gap */
-	vms_complete_munmap_vmas(vms, mas_detach);
-}
-
 int
 do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
@@ -336,6 +243,10 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 int mm_take_all_locks(struct mm_struct *mm);
 void mm_drop_all_locks(struct mm_struct *mm);

+unsigned long __mmap_region(struct file *file, unsigned long addr,
+		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+		struct list_head *uf);
+
 static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
 {
 	/*
diff --git a/mm/vma_internal.h b/mm/vma_internal.h
index b930ab12a587..fc5f172a36bd 100644
--- a/mm/vma_internal.h
+++ b/mm/vma_internal.h
@@ -17,8 +17,10 @@
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/huge_mm.h>
+#include <linux/hugetlb.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/kernel.h>
+#include <linux/ksm.h>
 #include <linux/khugepaged.h>
 #include <linux/list.h>
 #include <linux/maple_tree.h>
@@ -32,11 +34,14 @@
 #include <linux/mmu_context.h>
 #include <linux/mutex.h>
 #include <linux/pagemap.h>
+#include <linux/perf_event.h>
 #include <linux/pfn.h>
 #include <linux/rcupdate.h>
 #include <linux/rmap.h>
 #include <linux/rwsem.h>
 #include <linux/sched/signal.h>
+#include <linux/security.h>
+#include <linux/shmem_fs.h>
 #include <linux/swap.h>
 #include <linux/uprobes.h>
 #include <linux/userfaultfd_k.h>
--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region()
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 6/8] mm: insolate mmap internal logic to mm/vma.c Lorenzo Stoakes
@ 2024-10-22 20:40 ` Lorenzo Stoakes
  2024-10-23 14:38   ` Vlastimil Babka
  2024-10-23 17:19   ` Liam R. Howlett
  2024-10-22 20:40 ` [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs Lorenzo Stoakes
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

We have seen bugs and resource leaks arise from the complexity of the
__mmap_region() function. This, and the generally deeply fragile error
handling logic and complexity which makes understanding the function
difficult make it highly desirable to refactor it into something readable.

Achieve this by separating the function into smaller logical parts which
are easier to understand and follow, and which importantly very
significantly simplify the error handling.

Note that we now call vms_abort_munmap_vmas() in more error paths than we
used to, however in cases where no abort need occur, vms->nr_pages will be
equal to zero and we simply exit this function without doing more than we
would have done previously.

Importantly, the invocation of the driver mmap hook via mmap_file() now has
very simple and obvious handling (this was previously the most problematic
part of the mmap() operation).

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 380 +++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 240 insertions(+), 140 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 7617f9d50d62..a271e2b406ab 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -7,6 +7,31 @@
 #include "vma_internal.h"
 #include "vma.h"

+struct mmap_state {
+	struct mm_struct *mm;
+	struct vma_iterator *vmi;
+	struct vma_merge_struct *vmg;
+	struct list_head *uf;
+
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
+	struct maple_tree mt_detach;
+
+	unsigned long flags;
+	unsigned long pglen;
+	unsigned long charged;
+};
+
+#define MMAP_STATE(name, mm_, vmi_, vmg_, uf_, flags_, len_)	\
+	struct mmap_state name = {				\
+		.mm = mm_,					\
+		.vmi = vmi_,					\
+		.vmg = vmg_,					\
+		.uf = uf_,					\
+		.flags = flags_,				\
+		.pglen = PHYS_PFN(len_),			\
+	}
+
 static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
 {
 	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
@@ -2169,189 +2194,247 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
 	vms_complete_munmap_vmas(vms, mas_detach);
 }

-unsigned long __mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+/*
+ * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
+ *                    unmapped once the map operation is completed, check limits,
+ *                    account mapping and clean up any pre-existing VMAs.
+ *
+ * @map: Mapping state.
+ *
+ * Returns: 0 on success, error code otherwise.
+ */
+static int __mmap_prepare(struct mmap_state *map)
 {
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	pgoff_t pglen = PHYS_PFN(len);
-	unsigned long charged = 0;
-	struct vma_munmap_struct vms;
-	struct ma_state mas_detach;
-	struct maple_tree mt_detach;
-	unsigned long end = addr + len;
 	int error;
-	VMA_ITERATOR(vmi, mm, addr);
-	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
-
-	vmg.file = file;
-	/* Find the first overlapping VMA */
-	vma = vma_find(&vmi, end);
-	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
-	if (vma) {
-		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
-		mt_on_stack(mt_detach);
-		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+	struct vma_iterator *vmi = map->vmi;
+	struct vma_merge_struct *vmg = map->vmg;
+	struct vma_munmap_struct *vms = &map->vms;
+
+	/* Find the first overlapping VMA and initialise unmap state. */
+	vms->vma = vma_find(vmi, vmg->end);
+	init_vma_munmap(vms, vmi, vms->vma, vmg->start, vmg->end, map->uf,
+			/* unlock = */ false);
+
+	/* OK, we have overlapping VMAs - prepare to unmap them. */
+	if (vms->vma) {
+		mt_init_flags(&map->mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(map->mt_detach);
+		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
 		/* Prepare to unmap any existing mapping in the area */
-		error = vms_gather_munmap_vmas(&vms, &mas_detach);
+		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
 		if (error)
-			goto gather_failed;
+			return error;

-		vmg.next = vms.next;
-		vmg.prev = vms.prev;
-		vma = NULL;
+		vmg->next = vms->next;
+		vmg->prev = vms->prev;
 	} else {
-		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
+		vmg->next = vma_iter_next_rewind(vmi, &vmg->prev);
 	}

 	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
-		error = -ENOMEM;
-		goto abort_munmap;
-	}
+	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
+		return -ENOMEM;

-	/*
-	 * Private writable mapping: check memory availability
-	 */
-	if (accountable_mapping(file, vm_flags)) {
-		charged = pglen;
-		charged -= vms.nr_accounted;
-		if (charged) {
-			error = security_vm_enough_memory_mm(mm, charged);
+	/* Private writable mapping: check memory availability. */
+	if (accountable_mapping(vmg->file, map->flags)) {
+		map->charged = map->pglen;
+		map->charged -= vms->nr_accounted;
+		if (map->charged) {
+			error = security_vm_enough_memory_mm(map->mm, map->charged);
 			if (error)
-				goto abort_munmap;
+				return error;
 		}

-		vms.nr_accounted = 0;
-		vm_flags |= VM_ACCOUNT;
-		vmg.flags = vm_flags;
+		vms->nr_accounted = 0;
+		map->flags |= VM_ACCOUNT;
 	}

 	/*
-	 * clear PTEs while the vma is still in the tree so that rmap
+	 * Clear PTEs while the vma is still in the tree so that rmap
 	 * cannot race with the freeing later in the truncate scenario.
 	 * This is also needed for mmap_file(), which is why vm_ops
 	 * close function is called.
 	 */
-	vms_clean_up_area(&vms, &mas_detach);
-	vma = vma_merge_new_range(&vmg);
-	if (vma)
-		goto expanded;
+	vms_clean_up_area(vms, &map->mas_detach);
+
+	return 0;
+}
+
+static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
+			       struct vm_area_struct **mergep)
+{
+	struct vma_iterator *vmi = map->vmi;
+	struct vma_merge_struct *vmg = map->vmg;
+	int error;
+
+	vma->vm_file = get_file(vmg->file);
+	error = mmap_file(vma->vm_file, vma);
+	if (error) {
+		fput(vma->vm_file);
+		vma->vm_file = NULL;
+
+		vma_iter_set(vmi, vma->vm_end);
+		/* Undo any partial mapping done by a device driver. */
+		unmap_region(&vmi->mas, vma, vmg->prev, vmg->next);
+
+		return error;
+	}
+
+	/* Drivers cannot alter the address of the VMA. */
+	WARN_ON_ONCE(vmg->start != vma->vm_start);
+	/*
+	 * Drivers should not permit writability when previously it was
+	 * disallowed.
+	 */
+	VM_WARN_ON_ONCE(map->flags != vma->vm_flags &&
+			!(map->flags & VM_MAYWRITE) &&
+			(vma->vm_flags & VM_MAYWRITE));
+
+	vma_iter_config(vmi, vmg->start, vmg->end);
+	/*
+	 * If flags changed after mmap_file(), we should try merge
+	 * vma again as we may succeed this time.
+	 */
+	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
+		struct vm_area_struct *merge;
+
+		vmg->flags = vma->vm_flags;
+		/* If this fails, state is reset ready for a reattempt. */
+		merge = vma_merge_new_range(vmg);
+
+		if (merge) {
+			/*
+			 * ->mmap() can change vma->vm_file and fput
+			 * the original file. So fput the vma->vm_file
+			 * here or we would add an extra fput for file
+			 * and cause general protection fault
+			 * ultimately.
+			 */
+			fput(vma->vm_file);
+			vm_area_free(vma);
+			vma_iter_free(vmi);
+			*mergep = merge;
+		} else {
+			vma_iter_config(vmi, vmg->start, vmg->end);
+		}
+	}
+
+	map->flags = vma->vm_flags;
+	return 0;
+}
+
+/*
+ * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
+ *                    possible.
+ *
+ *                    An exception to this is if the mapping is file-backed, and
+ *                    the underlying driver changes the VMA flags, permitting a
+ *                    subsequent merge of the VMA, in which case the returned
+ *                    VMA is one that was merged on a second attempt.
+ *
+ * @map:  Mapping state.
+ * @vmap: Output pointer for the new VMA.
+ *
+ * Returns: Zero on success, or an error.
+ */
+static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
+{
+	struct vma_iterator *vmi = map->vmi;
+	struct vma_merge_struct *vmg = map->vmg;
+	struct vm_area_struct *merge = NULL;
+	int error = 0;
+	struct vm_area_struct *vma;
+
 	/*
 	 * Determine the object being mapped and call the appropriate
 	 * specific mapper. the address has already been validated, but
 	 * not unmapped, but the maps are removed from the list.
 	 */
-	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
-		goto unacct_error;
-	}
+	vma = vm_area_alloc(map->mm);
+	if (!vma)
+		return -ENOMEM;

-	vma_iter_config(&vmi, addr, end);
-	vma_set_range(vma, addr, end, pgoff);
-	vm_flags_init(vma, vm_flags);
-	vma->vm_page_prot = vm_get_page_prot(vm_flags);
+	vma_iter_config(vmi, vmg->start, vmg->end);
+	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
+	vm_flags_init(vma, map->flags);
+	vma->vm_page_prot = vm_get_page_prot(map->flags);

-	if (vma_iter_prealloc(&vmi, vma)) {
+	if (vma_iter_prealloc(vmi, vma)) {
 		error = -ENOMEM;
 		goto free_vma;
 	}

-	if (file) {
-		vma->vm_file = get_file(file);
-		error = mmap_file(file, vma);
-		if (error)
-			goto unmap_and_free_file_vma;
-
-		/* Drivers cannot alter the address of the VMA. */
-		WARN_ON_ONCE(addr != vma->vm_start);
-		/*
-		 * Drivers should not permit writability when previously it was
-		 * disallowed.
-		 */
-		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
-				!(vm_flags & VM_MAYWRITE) &&
-				(vma->vm_flags & VM_MAYWRITE));
-
-		vma_iter_config(&vmi, addr, end);
-		/*
-		 * If vm_flags changed after mmap_file(), we should try merge
-		 * vma again as we may succeed this time.
-		 */
-		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
-			struct vm_area_struct *merge;
-
-			vmg.flags = vma->vm_flags;
-			/* If this fails, state is reset ready for a reattempt. */
-			merge = vma_merge_new_range(&vmg);
-
-			if (merge) {
-				/*
-				 * ->mmap() can change vma->vm_file and fput
-				 * the original file. So fput the vma->vm_file
-				 * here or we would add an extra fput for file
-				 * and cause general protection fault
-				 * ultimately.
-				 */
-				fput(vma->vm_file);
-				vm_area_free(vma);
-				vma_iter_free(&vmi);
-				vma = merge;
-				/* Update vm_flags to pick up the change. */
-				vm_flags = vma->vm_flags;
-				goto file_expanded;
-			}
-			vma_iter_config(&vmi, addr, end);
-		}
-
-		vm_flags = vma->vm_flags;
-	} else if (vm_flags & VM_SHARED) {
+	if (vmg->file)
+		error = __mmap_new_file_vma(map, vma, &merge);
+	else if (map->flags & VM_SHARED)
 		error = shmem_zero_setup(vma);
-		if (error)
-			goto free_iter_vma;
-	} else {
+	else
 		vma_set_anonymous(vma);
-	}
+
+	if (error)
+		goto free_iter_vma;
+
+	if (merge)
+		goto file_expanded;

 #ifdef CONFIG_SPARC64
 	/* TODO: Fix SPARC ADI! */
-	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+	WARN_ON_ONCE(!arch_validate_flags(map->flags));
 #endif

 	/* Lock the VMA since it is modified after insertion into VMA tree */
 	vma_start_write(vma);
-	vma_iter_store(&vmi, vma);
-	mm->map_count++;
+	vma_iter_store(vmi, vma);
+	map->mm->map_count++;
 	vma_link_file(vma);

 	/*
 	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
 	 * call covers the non-merge case.
 	 */
-	khugepaged_enter_vma(vma, vma->vm_flags);
+	khugepaged_enter_vma(vma, map->flags);

 file_expanded:
-	file = vma->vm_file;
 	ksm_add_vma(vma);
-expanded:
+
+	*vmap = vma;
+	return 0;
+
+free_iter_vma:
+	vma_iter_free(vmi);
+free_vma:
+	vm_area_free(vma);
+	return error;
+}
+
+/*
+ * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
+ *                     statistics, handle locking and finalise the VMA.
+ *
+ * @map: Mapping state.
+ * @vma: Merged or newly allocated VMA for the mmap()'d region.
+ */
+static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = map->mm;
+	unsigned long vm_flags = vma->vm_flags;
+
 	perf_event_mmap(vma);

-	/* Unmap any existing mapping in the area */
-	vms_complete_munmap_vmas(&vms, &mas_detach);
+	/* Unmap any existing mapping in the area. */
+	vms_complete_munmap_vmas(&map->vms, &map->mas_detach);

-	vm_stat_account(mm, vm_flags, pglen);
+	vm_stat_account(mm, vma->vm_flags, map->pglen);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
 					is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm))
+					vma == get_gate_vma(mm))
 			vm_flags_clear(vma, VM_LOCKED_MASK);
 		else
-			mm->locked_vm += pglen;
+			mm->locked_vm += map->pglen;
 	}

-	if (file)
+	if (vma->vm_file)
 		uprobe_mmap(vma);

 	/*
@@ -2364,26 +2447,43 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
 	vm_flags_set(vma, VM_SOFTDIRTY);

 	vma_set_page_prot(vma);
+}

-	return addr;
+unsigned long __mmap_region(struct file *file, unsigned long addr,
+		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+		struct list_head *uf)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	int error;
+	VMA_ITERATOR(vmi, mm, addr);
+	VMG_STATE(vmg, mm, &vmi, addr, addr + len, vm_flags, pgoff);
+	MMAP_STATE(map, mm, &vmi, &vmg, uf, vm_flags, len);

-unmap_and_free_file_vma:
-	fput(vma->vm_file);
-	vma->vm_file = NULL;
+	vmg.file = file;

-	vma_iter_set(&vmi, vma->vm_end);
-	/* Undo any partial mapping done by a device driver. */
-	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
-free_iter_vma:
-	vma_iter_free(&vmi);
-free_vma:
-	vm_area_free(vma);
-unacct_error:
-	if (charged)
-		vm_unacct_memory(charged);
+	error = __mmap_prepare(&map);
+	if (error)
+		goto abort_munmap;
+
+	/* Attempt to merge with adjacent VMAs... */
+	vmg.flags = map.flags;
+	vma = vma_merge_new_range(&vmg);
+	if (!vma) {
+		/* ...but if we can't, allocate a new VMA. */
+		error = __mmap_new_vma(&map, &vma);
+		if (error)
+			goto unacct_error;
+	}
+
+	__mmap_complete(&map, vma);

+	return addr;
+
+unacct_error:
+	if (map.charged)
+		vm_unacct_memory(map.charged);
 abort_munmap:
-	vms_abort_munmap_vmas(&vms, &mas_detach);
-gather_failed:
+	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
 	return error;
 }
--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (6 preceding siblings ...)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
@ 2024-10-22 20:40 ` Lorenzo Stoakes
  2024-10-23 15:01   ` Vlastimil Babka
  2024-10-22 20:58 ` [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

Previously, we'd always try to merge a file-backed VMA if its flags were
changed by the driver.

This however is rarely meaningful as typically the flags would be changed
to VM_PFNMAP or other VM_SPECIAL flags which are inherently unmergable.

In cases where it is meaningful (for instance DAX) it is doubtful that this
optimisation is worth the effort and maintenance risk of having to unwind
state and perform a merge.

Since we've observed bugs and resource leaks due to complexity in this
area, it is simply not acceptable to have a 'nice to have' optimisation
like this complicating an already very complicated code path, so let's
simply eliminate it.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 39 ++-------------------------------------
 1 file changed, 2 insertions(+), 37 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index a271e2b406ab..fe1fe5099e78 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2260,8 +2260,7 @@ static int __mmap_prepare(struct mmap_state *map)
 	return 0;
 }

-static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
-			       struct vm_area_struct **mergep)
+static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma)
 {
 	struct vma_iterator *vmi = map->vmi;
 	struct vma_merge_struct *vmg = map->vmg;
@@ -2291,34 +2290,6 @@ static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vm
 			(vma->vm_flags & VM_MAYWRITE));

 	vma_iter_config(vmi, vmg->start, vmg->end);
-	/*
-	 * If flags changed after mmap_file(), we should try merge
-	 * vma again as we may succeed this time.
-	 */
-	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
-		struct vm_area_struct *merge;
-
-		vmg->flags = vma->vm_flags;
-		/* If this fails, state is reset ready for a reattempt. */
-		merge = vma_merge_new_range(vmg);
-
-		if (merge) {
-			/*
-			 * ->mmap() can change vma->vm_file and fput
-			 * the original file. So fput the vma->vm_file
-			 * here or we would add an extra fput for file
-			 * and cause general protection fault
-			 * ultimately.
-			 */
-			fput(vma->vm_file);
-			vm_area_free(vma);
-			vma_iter_free(vmi);
-			*mergep = merge;
-		} else {
-			vma_iter_config(vmi, vmg->start, vmg->end);
-		}
-	}
-
 	map->flags = vma->vm_flags;
 	return 0;
 }
@@ -2341,7 +2312,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 {
 	struct vma_iterator *vmi = map->vmi;
 	struct vma_merge_struct *vmg = map->vmg;
-	struct vm_area_struct *merge = NULL;
 	int error = 0;
 	struct vm_area_struct *vma;

@@ -2365,7 +2335,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 	}

 	if (vmg->file)
-		error = __mmap_new_file_vma(map, vma, &merge);
+		error = __mmap_new_file_vma(map, vma);
 	else if (map->flags & VM_SHARED)
 		error = shmem_zero_setup(vma);
 	else
@@ -2374,9 +2344,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 	if (error)
 		goto free_iter_vma;

-	if (merge)
-		goto file_expanded;
-
 #ifdef CONFIG_SPARC64
 	/* TODO: Fix SPARC ADI! */
 	WARN_ON_ONCE(!arch_validate_flags(map->flags));
@@ -2393,8 +2360,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 	 * call covers the non-merge case.
 	 */
 	khugepaged_enter_vma(vma, map->flags);
-
-file_expanded:
 	ksm_add_vma(vma);

 	*vmap = vma;
--
2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (7 preceding siblings ...)
  2024-10-22 20:40 ` [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs Lorenzo Stoakes
@ 2024-10-22 20:58 ` Lorenzo Stoakes
  2024-10-23  8:37 ` Vlastimil Babka
  2024-10-23 10:22 ` Andrew Morton
  10 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-22 20:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Tue, Oct 22, 2024 at 09:40:51PM +0100, Lorenzo Stoakes wrote:
> The mmap_region() function is somewhat terrifying, with spaghetti-like
> control flow and numerous means by which issues can arise and incomplete
> state, memory leaks and other unpleasantness can occur.
>
> A large amount of the complexity arises from trying to handle errors late
> in the process of mapping a VMA, which forms the basis of recently observed
> issues with resource leaks and observable inconsistent state.
>
> This series goes to great lengths to simplify how mmap_region() works and
> to avoid unwinding errors late on in the process of setting up the VMA for
> the new mapping, and equally avoids such operations occurring while the VMA
> is in an inconsistent state.
>
> The first four patches are intended for backporting to correct the
> possibility of people encountering corrupted state while invoking mmap()
> which is otherwise at risk of happening.
>
> After this we go further, refactoring the code, placing it in mm/vma.c in
> order to make it eventually userland testable, and significantly
> simplifying the logic to avoid this issue arising in future.
>
> Lorenzo Stoakes (8):
>   mm: avoid unsafe VMA hook invocation when error arises on mmap hook
>   mm: unconditionally close VMAs on error
>   mm: refactor map_deny_write_exec()
>   mm: resolve faulty mmap_region() error path behaviour
>   tools: testing: add additional vma_internal.h stubs
>   mm: insolate mmap internal logic to mm/vma.c
>   mm: refactor __mmap_region()
>   mm: do not attempt second merge for file-backed VMAs
>
>  include/linux/mman.h             |  21 +-
>  mm/internal.h                    |  44 ++++
>  mm/mmap.c                        | 262 ++------------------
>  mm/mprotect.c                    |   2 +-
>  mm/nommu.c                       |   7 +-
>  mm/vma.c                         | 403 ++++++++++++++++++++++++++++++-
>  mm/vma.h                         | 101 +-------
>  mm/vma_internal.h                |   5 +
>  tools/testing/vma/vma_internal.h | 106 +++++++-
>  9 files changed, 591 insertions(+), 360 deletions(-)
>
> --
> 2.47.0

To clarify - this applies against Andrew's tree [0] in the
mm-hotfixes-unstable branch in order to be applicable against outstanding
hotfix patches for 6.12.

[0]:https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  2024-10-22 20:40 ` [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
@ 2024-10-22 21:14   ` Jann Horn
  2024-10-23 16:56     ` Lorenzo Stoakes
  2024-10-23  9:11   ` Vlastimil Babka
  2024-10-23 14:22   ` Liam R. Howlett
  2 siblings, 1 reply; 46+ messages in thread
From: Jann Horn @ 2024-10-22 21:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Tue, Oct 22, 2024 at 10:41 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> After an attempted mmap() fails, we are no longer in a situation where we
> can safely interact with VMA hooks. This is currently not enforced, meaning
> that we need complicated handling to ensure we do not incorrectly call
> these hooks.
>
> We can avoid the whole issue by treating the VMA as suspect the moment that
> the file->f_ops->mmap() function reports an error by replacing whatever VMA
> operations were installed with a dummy empty set of VMA operations.
>
> We do so through a new helper function internal to mm - mmap_file() - which
> is both more logically named than the existing call_mmap() function and
> correctly isolates handling of the vm_op reassignment to mm.
>
> All the existing invocations of call_mmap() outside of mm are ultimately
> nested within the call_mmap() from mm, which we now replace.
>
> It is therefore safe to leave call_mmap() in place as a convenience
> function (and to avoid churn). The invokers are:
>
>      ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
>     coda_file_operations -> mmap -> coda_file_mmap()
>      shm_file_operations -> shm_mmap()
> shm_file_operations_huge -> shm_mmap()
>             dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
>                             -> i915_gem_dmabuf_mmap()
>
> None of these callers interact with vm_ops or mappings in a problematic way
> on error, quickly exiting out.
>
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")

(I guess the "Fixes" tag here is a little unconventional in that it
doesn't actually point at the commit introducing the issue that this
commit describes, but it does mark to where the fix should be
backported, so I guess it makes sense and I don't have any better
suggestion.)

> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Jann Horn <jannh@google.com>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error
  2024-10-22 20:40 ` [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
@ 2024-10-22 21:15   ` Jann Horn
  2024-10-23 17:00     ` Lorenzo Stoakes
  2024-10-23  9:24   ` Vlastimil Babka
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 46+ messages in thread
From: Jann Horn @ 2024-10-22 21:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Tue, Oct 22, 2024 at 10:41 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Incorrect invocation of VMA callbacks when the VMA is no longer in a
> consistent state is bug prone and risky to perform.
>
> With regards to the important vm_ops->close() callback We have gone to
> great lengths to try to track whether or not we ought to close VMAs.
>
> Rather than doing so and risking making a mistake somewhere, instead
> unconditionally close and reset vma->vm_ops to an empty dummy operations
> set with a NULL .close operator.
>
> We introduce a new function to do so - vma_close() - and simplify existing
> vms logic which tracked whether we needed to close or not.
>
> This simplifies the logic, avoids incorrect double-calling of the .close()
> callback and allows us to update error paths to simply call vma_close()
> unconditionally - making VMA closure idempotent.
>
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Jann Horn <jannh@google.com>

[...]
> diff --git a/mm/vma.h b/mm/vma.h
> index 55457cb68200..75558b5e9c8c 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -42,7 +42,6 @@ struct vma_munmap_struct {
>         int vma_count;                  /* Number of vmas that will be removed */
>         bool unlock;                    /* Unlock after the munmap */
>         bool clear_ptes;                /* If there are outstanding PTE to be cleared */
> -       bool closed_vm_ops;             /* call_mmap() was encountered, so vmas may be closed */
>         /* 1 byte hole */

nit: outdated comment, this hole is 2 bytes now



>         unsigned long nr_pages;         /* Number of pages being removed */
>         unsigned long locked_vm;        /* Number of locked pages */


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec()
  2024-10-22 20:40 ` [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
@ 2024-10-22 21:15   ` Jann Horn
  2024-10-23 16:35     ` Lorenzo Stoakes
  2024-10-23  9:57   ` Vlastimil Babka
  2024-10-23 14:30   ` Liam R. Howlett
  2 siblings, 1 reply; 46+ messages in thread
From: Jann Horn @ 2024-10-22 21:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Tue, Oct 22, 2024 at 10:41 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Refactor the map_deny_write_exec() to not unnecessarily require a VMA
> parameter but rather to accept VMA flags parameters, which allows us to use
> this function early in mmap_region() in a subsequent commit.
>
> While we're here, we refactor the function to be more readable and add some
> additional documentation.
>
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Jann Horn <jannh@google.com>

[..]
> -static inline bool map_deny_write_exec(struct vm_area_struct *vma,  unsigned long vm_flags)
> +static inline bool map_deny_write_exec(unsigned long old, unsigned long new)
>  {
> +       /* If MDWE is disabled, we have nothing to deny. */
>         if (!test_bit(MMF_HAS_MDWE, &current->mm->flags))
>                 return false;
>
> -       if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
> +       /* If the new VMA is not executable, we have nothing to deny. */
> +       if (!(new & VM_EXEC))
> +               return false;
> +
> +       /* Under MDWE we absolutely do not accept writably executable... */
> +       if (new & VM_WRITE)
>                 return true;
>
> -       if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC))
> +       /* ...nor newly executable VMAs. */

nit: maybe clarify this as "nor existing VMAs newly becoming
executable" or something like that


> +       if (!(old & VM_EXEC))
>                 return true;
>
>         return false;


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (8 preceding siblings ...)
  2024-10-22 20:58 ` [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
@ 2024-10-23  8:37 ` Vlastimil Babka
  2024-10-23  8:45   ` Lorenzo Stoakes
  2024-10-23 10:22 ` Andrew Morton
  10 siblings, 1 reply; 46+ messages in thread
From: Vlastimil Babka @ 2024-10-23  8:37 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/22/24 22:40, Lorenzo Stoakes wrote:
> The mmap_region() function is somewhat terrifying, with spaghetti-like
> control flow and numerous means by which issues can arise and incomplete
> state, memory leaks and other unpleasantness can occur.
> 
> A large amount of the complexity arises from trying to handle errors late
> in the process of mapping a VMA, which forms the basis of recently observed
> issues with resource leaks and observable inconsistent state.
> 
> This series goes to great lengths to simplify how mmap_region() works and
> to avoid unwinding errors late on in the process of setting up the VMA for
> the new mapping, and equally avoids such operations occurring while the VMA
> is in an inconsistent state.
> 
> The first four patches are intended for backporting to correct the
> possibility of people encountering corrupted state while invoking mmap()
> which is otherwise at risk of happening.
> 
> After this we go further, refactoring the code, placing it in mm/vma.c in
> order to make it eventually userland testable, and significantly
> simplifying the logic to avoid this issue arising in future.

Are the latter 4 patches thus also intended as hotfix for 6.12, or was it
just due to git applying the same PATCH prefix to the whole series, and the
real intention is 6.13?

> Lorenzo Stoakes (8):
>   mm: avoid unsafe VMA hook invocation when error arises on mmap hook
>   mm: unconditionally close VMAs on error
>   mm: refactor map_deny_write_exec()
>   mm: resolve faulty mmap_region() error path behaviour
>   tools: testing: add additional vma_internal.h stubs
>   mm: insolate mmap internal logic to mm/vma.c
>   mm: refactor __mmap_region()
>   mm: do not attempt second merge for file-backed VMAs
> 
>  include/linux/mman.h             |  21 +-
>  mm/internal.h                    |  44 ++++
>  mm/mmap.c                        | 262 ++------------------
>  mm/mprotect.c                    |   2 +-
>  mm/nommu.c                       |   7 +-
>  mm/vma.c                         | 403 ++++++++++++++++++++++++++++++-
>  mm/vma.h                         | 101 +-------
>  mm/vma_internal.h                |   5 +
>  tools/testing/vma/vma_internal.h | 106 +++++++-
>  9 files changed, 591 insertions(+), 360 deletions(-)
> 
> --
> 2.47.0



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor
  2024-10-23  8:37 ` Vlastimil Babka
@ 2024-10-23  8:45   ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23  8:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 10:37:04AM +0200, Vlastimil Babka wrote:
> On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > The mmap_region() function is somewhat terrifying, with spaghetti-like
> > control flow and numerous means by which issues can arise and incomplete
> > state, memory leaks and other unpleasantness can occur.
> >
> > A large amount of the complexity arises from trying to handle errors late
> > in the process of mapping a VMA, which forms the basis of recently observed
> > issues with resource leaks and observable inconsistent state.
> >
> > This series goes to great lengths to simplify how mmap_region() works and
> > to avoid unwinding errors late on in the process of setting up the VMA for
> > the new mapping, and equally avoids such operations occurring while the VMA
> > is in an inconsistent state.
> >
> > The first four patches are intended for backporting to correct the
> > possibility of people encountering corrupted state while invoking mmap()
> > which is otherwise at risk of happening.
> >
> > After this we go further, refactoring the code, placing it in mm/vma.c in
> > order to make it eventually userland testable, and significantly
> > simplifying the logic to avoid this issue arising in future.
>
> Are the latter 4 patches thus also intended as hotfix for 6.12, or was it
> just due to git applying the same PATCH prefix to the whole series, and the
> real intention is 6.13?

They are intended as they are in the same vein, so one could argue we
should hotfix that and be done with it... but at the same time I realise
it's a bit of a stretch so I'm happy to split the last four patches for
6.13.

The issue is expressing that the the last 4 have a dependency on the first
4 and this not getting lost in the mix...

To be clear though, the first 4 do very much _have_ to be hotfixes and ones
which are backported to stable.

Andrew - we can do whatever makes sense to you here.

Thanks!

>
> > Lorenzo Stoakes (8):
> >   mm: avoid unsafe VMA hook invocation when error arises on mmap hook
> >   mm: unconditionally close VMAs on error
> >   mm: refactor map_deny_write_exec()
> >   mm: resolve faulty mmap_region() error path behaviour
> >   tools: testing: add additional vma_internal.h stubs
> >   mm: insolate mmap internal logic to mm/vma.c
> >   mm: refactor __mmap_region()
> >   mm: do not attempt second merge for file-backed VMAs
> >
> >  include/linux/mman.h             |  21 +-
> >  mm/internal.h                    |  44 ++++
> >  mm/mmap.c                        | 262 ++------------------
> >  mm/mprotect.c                    |   2 +-
> >  mm/nommu.c                       |   7 +-
> >  mm/vma.c                         | 403 ++++++++++++++++++++++++++++++-
> >  mm/vma.h                         | 101 +-------
> >  mm/vma_internal.h                |   5 +
> >  tools/testing/vma/vma_internal.h | 106 +++++++-
> >  9 files changed, 591 insertions(+), 360 deletions(-)
> >
> > --
> > 2.47.0
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  2024-10-22 20:40 ` [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
  2024-10-22 21:14   ` Jann Horn
@ 2024-10-23  9:11   ` Vlastimil Babka
  2024-10-23 14:22   ` Liam R. Howlett
  2 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2024-10-23  9:11 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/22/24 22:40, Lorenzo Stoakes wrote:
> After an attempted mmap() fails, we are no longer in a situation where we
> can safely interact with VMA hooks. This is currently not enforced, meaning
> that we need complicated handling to ensure we do not incorrectly call
> these hooks.
> 
> We can avoid the whole issue by treating the VMA as suspect the moment that
> the file->f_ops->mmap() function reports an error by replacing whatever VMA
> operations were installed with a dummy empty set of VMA operations.
> 
> We do so through a new helper function internal to mm - mmap_file() - which
> is both more logically named than the existing call_mmap() function and
> correctly isolates handling of the vm_op reassignment to mm.
> 
> All the existing invocations of call_mmap() outside of mm are ultimately
> nested within the call_mmap() from mm, which we now replace.
> 
> It is therefore safe to leave call_mmap() in place as a convenience
> function (and to avoid churn). The invokers are:
> 
>      ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
>     coda_file_operations -> mmap -> coda_file_mmap()
>      shm_file_operations -> shm_mmap()
> shm_file_operations_huge -> shm_mmap()
>             dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
> 	                    -> i915_gem_dmabuf_mmap()
> 
> None of these callers interact with vm_ops or mappings in a problematic way
> on error, quickly exiting out.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error
  2024-10-22 20:40 ` [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
  2024-10-22 21:15   ` Jann Horn
@ 2024-10-23  9:24   ` Vlastimil Babka
  2024-10-23 16:58     ` Lorenzo Stoakes
  2024-10-23 14:26   ` Liam R. Howlett
  2024-10-23 14:41   ` Liam R. Howlett
  3 siblings, 1 reply; 46+ messages in thread
From: Vlastimil Babka @ 2024-10-23  9:24 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/22/24 22:40, Lorenzo Stoakes wrote:
> Incorrect invocation of VMA callbacks when the VMA is no longer in a
> consistent state is bug prone and risky to perform.
> 
> With regards to the important vm_ops->close() callback We have gone to
> great lengths to try to track whether or not we ought to close VMAs.
> 
> Rather than doing so and risking making a mistake somewhere, instead
> unconditionally close and reset vma->vm_ops to an empty dummy operations
> set with a NULL .close operator.
> 
> We introduce a new function to do so - vma_close() - and simplify existing
> vms logic which tracked whether we needed to close or not.
> 
> This simplifies the logic, avoids incorrect double-calling of the .close()
> callback and allows us to update error paths to simply call vma_close()
> unconditionally - making VMA closure idempotent.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Nice simplification. Nit below.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> +/*
> + * Unconditionally close the VMA if it has a close hook and prevent hooks from
> + * being invoked after close. VMA hooks are mutated.
> + */
> +static inline void vma_close(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_ops && vma->vm_ops->close) {
> +		vma->vm_ops->close(vma);
> +
> +		/*
> +		 * The mapping is in an inconsistent state, and no further hooks
> +		 * may be invoked upon it.
> +		 */
> +		vma->vm_ops = &vma_dummy_vm_ops;
> +	}

Nit: if we want to "prevent hooks" as in "any hooks" then we should be
replacing existing vm_ops even if it has no close hook? If it's enough to
prevent further close() hooks (as commit log suggests) then the
implementation is fine but the comment might be misleading.

> +}
> +
>  #ifdef CONFIG_MMU
> 
>  /* Flags for folio_pte_batch(). */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 10f4ccaf491b..d55c58e99a54 100644


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec()
  2024-10-22 20:40 ` [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
  2024-10-22 21:15   ` Jann Horn
@ 2024-10-23  9:57   ` Vlastimil Babka
  2024-10-23 14:30   ` Liam R. Howlett
  2 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2024-10-23  9:57 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/22/24 22:40, Lorenzo Stoakes wrote:
> Refactor the map_deny_write_exec() to not unnecessarily require a VMA
> parameter but rather to accept VMA flags parameters, which allows us to use
> this function early in mmap_region() in a subsequent commit.
> 
> While we're here, we refactor the function to be more readable and add some
> additional documentation.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor
  2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (9 preceding siblings ...)
  2024-10-23  8:37 ` Vlastimil Babka
@ 2024-10-23 10:22 ` Andrew Morton
  2024-10-23 10:25   ` Andrew Morton
  2024-10-23 10:26   ` Lorenzo Stoakes
  10 siblings, 2 replies; 46+ messages in thread
From: Andrew Morton @ 2024-10-23 10:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Tue, 22 Oct 2024 21:40:51 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:

> After this we go further, refactoring the code, placing it in mm/vma.c in
> order to make it eventually userland testable, and significantly
> simplifying the logic to avoid this issue arising in future.

I'm not sure that patches 4-8 are strictly 6.12 material.  What's the
thinking here?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor
  2024-10-23 10:22 ` Andrew Morton
@ 2024-10-23 10:25   ` Andrew Morton
  2024-10-23 10:26   ` Lorenzo Stoakes
  1 sibling, 0 replies; 46+ messages in thread
From: Andrew Morton @ 2024-10-23 10:25 UTC (permalink / raw)
  To: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu

On Wed, 23 Oct 2024 03:22:36 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 22 Oct 2024 21:40:51 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> 
> > After this we go further, refactoring the code, placing it in mm/vma.c in
> > order to make it eventually userland testable, and significantly
> > simplifying the logic to avoid this issue arising in future.
> 
> I'm not sure that patches 4-8 are strictly 6.12 material.  What's the
> thinking here?
> 

A downside to merging all 8 patches into 6.12-rcX is that when -stable
takes patches 1-4, they will have had negligible testing without the
presence of patches 5-8.  Patches 5-8 might accidentally fix
something..


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor
  2024-10-23 10:22 ` Andrew Morton
  2024-10-23 10:25   ` Andrew Morton
@ 2024-10-23 10:26   ` Lorenzo Stoakes
  1 sibling, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 10:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 03:22:36AM -0700, Andrew Morton wrote:
> On Tue, 22 Oct 2024 21:40:51 +0100 Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>
> > After this we go further, refactoring the code, placing it in mm/vma.c in
> > order to make it eventually userland testable, and significantly
> > simplifying the logic to avoid this issue arising in future.
>
> I'm not sure that patches 4-8 are strictly 6.12 material.  What's the
> thinking here?
>

Patch 4 is absolutely critical and without which 1-3 don't really matter
all that much. It also needs to be backported (I will be doing some manual
backports for that once landed).

Patches 5-8 could be held back to 6.13 if preferred as, while nice to have
as they do further work to eliminate this class of bugs, are not as
important.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-22 20:40 ` [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
@ 2024-10-23 12:58   ` Vlastimil Babka
  2024-10-23 14:20     ` Liam R. Howlett
  2024-10-23 15:02   ` Liam R. Howlett
  1 sibling, 1 reply; 46+ messages in thread
From: Vlastimil Babka @ 2024-10-23 12:58 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/22/24 22:40, Lorenzo Stoakes wrote:
> The mmap_region() function is somewhat terrifying, with spaghetti-like
> control flow and numerous means by which issues can arise and incomplete
> state, memory leaks and other unpleasantness can occur.
> 
> A large amount of the complexity arises from trying to handle errors late
> in the process of mapping a VMA, which forms the basis of recently observed
> issues with resource leaks and observable inconsistent state.
> 
> Taking advantage of previous patches in this series we move a number of
> checks earlier in the code, simplifying things by moving the core of the
> logic into a static internal function __mmap_region().
> 
> Doing this allows us to perform a number of checks up front before we do
> any real work, and allows us to unwind the writable unmap check
> unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
> validation unconditionally also.
> 
> We move a number of things here:
> 
> 1. We preallocate memory for the iterator before we call the file-backed
>    memory hook, allowing us to exit early and avoid having to perform
>    complicated and error-prone close/free logic. We carefully free
>    iterator state on both success and error paths.
> 
> 2. The enclosing mmap_region() function handles the mapping_map_writable()
>    logic early. Previously the logic had the mapping_map_writable() at the
>    point of mapping a newly allocated file-backed VMA, and a matching
>    mapping_unmap_writable() on success and error paths.
> 
>    We now do this unconditionally if this is a file-backed, shared writable
>    mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
>    doing so does not invalidate the seal check we just performed, and we in
>    any case always decrement the counter in the wrapper.
> 
>    We perform a debug assert to ensure a driver does not attempt to do the
>    opposite.
> 
> 3. We also move arch_validate_flags() up into the mmap_region()
>    function. This is only relevant on arm64 and sparc64, and the check is
>    only meaningful for SPARC with ADI enabled. We explicitly add a warning
>    for this arch if a driver invalidates this check, though the code ought
>    eventually to be fixed to eliminate the need for this.
> 
> With all of these measures in place, we no longer need to explicitly close
> the VMA on error paths, as we place all checks which might fail prior to a
> call to any driver mmap hook.
> 
> This eliminates an entire class of errors, makes the code easier to reason
> about and more robust.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

some nits below

> ---
>  mm/mmap.c | 120 ++++++++++++++++++++++++++++++------------------------
>  1 file changed, 66 insertions(+), 54 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 66edf0ebba94..7d02b47a1895 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1361,20 +1361,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
>  }
> 
> -unsigned long mmap_region(struct file *file, unsigned long addr,
> +static unsigned long __mmap_region(struct file *file, unsigned long addr,
>  		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
>  		struct list_head *uf)
>  {
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = NULL;
>  	pgoff_t pglen = PHYS_PFN(len);
> -	struct vm_area_struct *merge;
>  	unsigned long charged = 0;
>  	struct vma_munmap_struct vms;
>  	struct ma_state mas_detach;
>  	struct maple_tree mt_detach;
>  	unsigned long end = addr + len;
> -	bool writable_file_mapping = false;
>  	int error;
>  	VMA_ITERATOR(vmi, mm, addr);
>  	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> @@ -1448,28 +1446,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	vm_flags_init(vma, vm_flags);
>  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> 
> +	if (vma_iter_prealloc(&vmi, vma)) {
> +		error = -ENOMEM;
> +		goto free_vma;
> +	}
> +
>  	if (file) {
>  		vma->vm_file = get_file(file);
>  		error = mmap_file(file, vma);
>  		if (error)
> -			goto unmap_and_free_vma;
> -
> -		if (vma_is_shared_maywrite(vma)) {
> -			error = mapping_map_writable(file->f_mapping);
> -			if (error)
> -				goto close_and_free_vma;
> -
> -			writable_file_mapping = true;
> -		}
> +			goto unmap_and_free_file_vma;
> 
> +		/* Drivers cannot alter the address of the VMA. */
> +		WARN_ON_ONCE(addr != vma->vm_start);
>  		/*
> -		 * Expansion is handled above, merging is handled below.
> -		 * Drivers should not alter the address of the VMA.
> +		 * Drivers should not permit writability when previously it was
> +		 * disallowed.
>  		 */
> -		if (WARN_ON((addr != vma->vm_start))) {
> -			error = -EINVAL;
> -			goto close_and_free_vma;
> -		}
> +		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> +				!(vm_flags & VM_MAYWRITE) &&
> +				(vma->vm_flags & VM_MAYWRITE));
> 
>  		vma_iter_config(&vmi, addr, end);

I wonder if this one could be removed, earlier above we did the same config
and neither parameters changed? But it was true before this patch as well,
and maybe it's further refactored away later in the series, just noting.

>  		/*
> @@ -1477,6 +1473,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		 * vma again as we may succeed this time.
>  		 */
>  		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> +			struct vm_area_struct *merge;
> +
>  			vmg.flags = vma->vm_flags;
>  			/* If this fails, state is reset ready for a reattempt. */
>  			merge = vma_merge_new_range(&vmg);
> @@ -1491,10 +1489,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  				 */
>  				fput(vma->vm_file);
>  				vm_area_free(vma);
> +				vma_iter_free(&vmi);

If we merged successfully, I think this is not necessary? But doesn't hurt?

>  				vma = merge;
>  				/* Update vm_flags to pick up the change. */
>  				vm_flags = vma->vm_flags;
> -				goto unmap_writable;
> +				goto file_expanded;
>  			}
>  			vma_iter_config(&vmi, addr, end);
>  		}
> @@ -1503,26 +1502,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	} else if (vm_flags & VM_SHARED) {
>  		error = shmem_zero_setup(vma);
>  		if (error)
> -			goto free_vma;
> +			goto free_iter_vma;
>  	} else {
>  		vma_set_anonymous(vma);
>  	}
> 
> -	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
> -		error = -EACCES;
> -		goto close_and_free_vma;
> -	}
> -
> -	/* Allow architectures to sanity-check the vm_flags */
> -	if (!arch_validate_flags(vma->vm_flags)) {
> -		error = -EINVAL;
> -		goto close_and_free_vma;
> -	}
> -
> -	if (vma_iter_prealloc(&vmi, vma)) {
> -		error = -ENOMEM;
> -		goto close_and_free_vma;
> -	}
> +#ifdef CONFIG_SPARC64
> +	/* TODO: Fix SPARC ADI! */
> +	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> +#endif
> 
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
> @@ -1536,10 +1524,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 */
>  	khugepaged_enter_vma(vma, vma->vm_flags);
> 
> -	/* Once vma denies write, undo our temporary denial count */
> -unmap_writable:
> -	if (writable_file_mapping)
> -		mapping_unmap_writable(file->f_mapping);
> +file_expanded:
>  	file = vma->vm_file;
>  	ksm_add_vma(vma);
>  expanded:
> @@ -1572,23 +1557,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> 
>  	vma_set_page_prot(vma);
> 
> -	validate_mm(mm);
>  	return addr;
> 
> -close_and_free_vma:
> -	vma_close(vma);
> -
> -	if (file || vma->vm_file) {
> -unmap_and_free_vma:
> -		fput(vma->vm_file);
> -		vma->vm_file = NULL;
> +unmap_and_free_file_vma:
> +	fput(vma->vm_file);
> +	vma->vm_file = NULL;
> 
> -		vma_iter_set(&vmi, vma->vm_end);
> -		/* Undo any partial mapping done by a device driver. */
> -		unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> -	}
> -	if (writable_file_mapping)
> -		mapping_unmap_writable(file->f_mapping);
> +	vma_iter_set(&vmi, vma->vm_end);
> +	/* Undo any partial mapping done by a device driver. */
> +	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> +free_iter_vma:
> +	vma_iter_free(&vmi);
>  free_vma:
>  	vm_area_free(vma);
>  unacct_error:
> @@ -1598,10 +1577,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  abort_munmap:
>  	vms_abort_munmap_vmas(&vms, &mas_detach);
>  gather_failed:
> -	validate_mm(mm);
>  	return error;
>  }
> 
> +unsigned long mmap_region(struct file *file, unsigned long addr,
> +			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> +			  struct list_head *uf)
> +{
> +	unsigned long ret;
> +	bool writable_file_mapping = false;
> +
> +	/* Allow architectures to sanity-check the vm_flags. */
> +	if (!arch_validate_flags(vm_flags))
> +		return -EINVAL;
> +
> +	/* Check to see if MDWE is applicable. */
> +	if (map_deny_write_exec(vm_flags, vm_flags))
> +		return -EACCES;

The two checks above used to be in the opposite order. Can we keep that just
to be sure we don't change user observable behavior unnecessarily?

> +	/* Map writable and ensure this isn't a sealed memfd. */
> +	if (file && is_shared_maywrite(vm_flags)) {
> +		int error = mapping_map_writable(file->f_mapping);
> +
> +		if (error)
> +			return error;
> +		writable_file_mapping = true;
> +	}
> +
> +	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
> +
> +	/* Clear our write mapping regardless of error. */
> +	if (writable_file_mapping)
> +		mapping_unmap_writable(file->f_mapping);
> +
> +	validate_mm(current->mm);
> +	return ret;
> +}
> +
>  static int __vm_munmap(unsigned long start, size_t len, bool unlock)
>  {
>  	int ret;
> --
> 2.47.0



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-23 12:58   ` Vlastimil Babka
@ 2024-10-23 14:20     ` Liam R. Howlett
  2024-10-23 17:11       ` Lorenzo Stoakes
  0 siblings, 1 reply; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 14:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, Andrew Morton, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Vlastimil Babka <vbabka@suse.cz> [241023 08:59]:
> On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > The mmap_region() function is somewhat terrifying, with spaghetti-like
> > control flow and numerous means by which issues can arise and incomplete
> > state, memory leaks and other unpleasantness can occur.
> > 
> > A large amount of the complexity arises from trying to handle errors late
> > in the process of mapping a VMA, which forms the basis of recently observed
> > issues with resource leaks and observable inconsistent state.
> > 
> > Taking advantage of previous patches in this series we move a number of
> > checks earlier in the code, simplifying things by moving the core of the
> > logic into a static internal function __mmap_region().
> > 
> > Doing this allows us to perform a number of checks up front before we do
> > any real work, and allows us to unwind the writable unmap check
> > unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
> > validation unconditionally also.
> > 
> > We move a number of things here:
> > 
> > 1. We preallocate memory for the iterator before we call the file-backed
> >    memory hook, allowing us to exit early and avoid having to perform
> >    complicated and error-prone close/free logic. We carefully free
> >    iterator state on both success and error paths.
> > 
> > 2. The enclosing mmap_region() function handles the mapping_map_writable()
> >    logic early. Previously the logic had the mapping_map_writable() at the
> >    point of mapping a newly allocated file-backed VMA, and a matching
> >    mapping_unmap_writable() on success and error paths.
> > 
> >    We now do this unconditionally if this is a file-backed, shared writable
> >    mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
> >    doing so does not invalidate the seal check we just performed, and we in
> >    any case always decrement the counter in the wrapper.
> > 
> >    We perform a debug assert to ensure a driver does not attempt to do the
> >    opposite.
> > 
> > 3. We also move arch_validate_flags() up into the mmap_region()
> >    function. This is only relevant on arm64 and sparc64, and the check is
> >    only meaningful for SPARC with ADI enabled. We explicitly add a warning
> >    for this arch if a driver invalidates this check, though the code ought
> >    eventually to be fixed to eliminate the need for this.
> > 
> > With all of these measures in place, we no longer need to explicitly close
> > the VMA on error paths, as we place all checks which might fail prior to a
> > call to any driver mmap hook.
> > 
> > This eliminates an entire class of errors, makes the code easier to reason
> > about and more robust.
> > 
> > Reported-by: Jann Horn <jannh@google.com>
> > Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> > Cc: stable <stable@kernel.org>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> some nits below
> 
> > ---
> >  mm/mmap.c | 120 ++++++++++++++++++++++++++++++------------------------
> >  1 file changed, 66 insertions(+), 54 deletions(-)
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 66edf0ebba94..7d02b47a1895 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1361,20 +1361,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> >  	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
> >  }
> > 
> > -unsigned long mmap_region(struct file *file, unsigned long addr,
> > +static unsigned long __mmap_region(struct file *file, unsigned long addr,
> >  		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> >  		struct list_head *uf)
> >  {
> >  	struct mm_struct *mm = current->mm;
> >  	struct vm_area_struct *vma = NULL;
> >  	pgoff_t pglen = PHYS_PFN(len);
> > -	struct vm_area_struct *merge;
> >  	unsigned long charged = 0;
> >  	struct vma_munmap_struct vms;
> >  	struct ma_state mas_detach;
> >  	struct maple_tree mt_detach;
> >  	unsigned long end = addr + len;
> > -	bool writable_file_mapping = false;
> >  	int error;
> >  	VMA_ITERATOR(vmi, mm, addr);
> >  	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > @@ -1448,28 +1446,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	vm_flags_init(vma, vm_flags);
> >  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > 
> > +	if (vma_iter_prealloc(&vmi, vma)) {
> > +		error = -ENOMEM;
> > +		goto free_vma;
> > +	}
> > +
> >  	if (file) {
> >  		vma->vm_file = get_file(file);
> >  		error = mmap_file(file, vma);
> >  		if (error)
> > -			goto unmap_and_free_vma;
> > -
> > -		if (vma_is_shared_maywrite(vma)) {
> > -			error = mapping_map_writable(file->f_mapping);
> > -			if (error)
> > -				goto close_and_free_vma;
> > -
> > -			writable_file_mapping = true;
> > -		}
> > +			goto unmap_and_free_file_vma;
> > 
> > +		/* Drivers cannot alter the address of the VMA. */
> > +		WARN_ON_ONCE(addr != vma->vm_start);
> >  		/*
> > -		 * Expansion is handled above, merging is handled below.
> > -		 * Drivers should not alter the address of the VMA.
> > +		 * Drivers should not permit writability when previously it was
> > +		 * disallowed.
> >  		 */
> > -		if (WARN_ON((addr != vma->vm_start))) {
> > -			error = -EINVAL;
> > -			goto close_and_free_vma;
> > -		}
> > +		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> > +				!(vm_flags & VM_MAYWRITE) &&
> > +				(vma->vm_flags & VM_MAYWRITE));
> > 
> >  		vma_iter_config(&vmi, addr, end);
> 
> I wonder if this one could be removed, earlier above we did the same config
> and neither parameters changed? But it was true before this patch as well,
> and maybe it's further refactored away later in the series, just noting.

Yes, this was here in case the vma changed address, so it's probably not
necessary.

> 
> >  		/*
> > @@ -1477,6 +1473,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		 * vma again as we may succeed this time.
> >  		 */
> >  		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> > +			struct vm_area_struct *merge;
> > +
> >  			vmg.flags = vma->vm_flags;
> >  			/* If this fails, state is reset ready for a reattempt. */
> >  			merge = vma_merge_new_range(&vmg);
> > @@ -1491,10 +1489,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  				 */
> >  				fput(vma->vm_file);
> >  				vm_area_free(vma);
> > +				vma_iter_free(&vmi);
> 
> If we merged successfully, I think this is not necessary? But doesn't hurt?

Yes, it will use the allocations (and re-allocate more if necessary)
then free the unused allocations once this call path reaches
commit_merge() with the same vmi, which is nice.

And yes, it is safe to do regardless.

To be honest, this whole block is so rare that I want to delete it
anyways.

> 
> >  				vma = merge;
> >  				/* Update vm_flags to pick up the change. */
> >  				vm_flags = vma->vm_flags;
> > -				goto unmap_writable;
> > +				goto file_expanded;
> >  			}
> >  			vma_iter_config(&vmi, addr, end);
> >  		}
> > @@ -1503,26 +1502,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	} else if (vm_flags & VM_SHARED) {
> >  		error = shmem_zero_setup(vma);
> >  		if (error)
> > -			goto free_vma;
> > +			goto free_iter_vma;
> >  	} else {
> >  		vma_set_anonymous(vma);
> >  	}
> > 
> > -	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
> > -		error = -EACCES;
> > -		goto close_and_free_vma;
> > -	}
> > -
> > -	/* Allow architectures to sanity-check the vm_flags */
> > -	if (!arch_validate_flags(vma->vm_flags)) {
> > -		error = -EINVAL;
> > -		goto close_and_free_vma;
> > -	}
> > -
> > -	if (vma_iter_prealloc(&vmi, vma)) {
> > -		error = -ENOMEM;
> > -		goto close_and_free_vma;
> > -	}
> > +#ifdef CONFIG_SPARC64
> > +	/* TODO: Fix SPARC ADI! */
> > +	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> > +#endif
> > 
> >  	/* Lock the VMA since it is modified after insertion into VMA tree */
> >  	vma_start_write(vma);
> > @@ -1536,10 +1524,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	 */
> >  	khugepaged_enter_vma(vma, vma->vm_flags);
> > 
> > -	/* Once vma denies write, undo our temporary denial count */
> > -unmap_writable:
> > -	if (writable_file_mapping)
> > -		mapping_unmap_writable(file->f_mapping);
> > +file_expanded:
> >  	file = vma->vm_file;
> >  	ksm_add_vma(vma);
> >  expanded:
> > @@ -1572,23 +1557,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > 
> >  	vma_set_page_prot(vma);
> > 
> > -	validate_mm(mm);
> >  	return addr;
> > 
> > -close_and_free_vma:
> > -	vma_close(vma);
> > -
> > -	if (file || vma->vm_file) {
> > -unmap_and_free_vma:
> > -		fput(vma->vm_file);
> > -		vma->vm_file = NULL;
> > +unmap_and_free_file_vma:
> > +	fput(vma->vm_file);
> > +	vma->vm_file = NULL;
> > 
> > -		vma_iter_set(&vmi, vma->vm_end);
> > -		/* Undo any partial mapping done by a device driver. */
> > -		unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > -	}
> > -	if (writable_file_mapping)
> > -		mapping_unmap_writable(file->f_mapping);
> > +	vma_iter_set(&vmi, vma->vm_end);
> > +	/* Undo any partial mapping done by a device driver. */
> > +	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > +free_iter_vma:
> > +	vma_iter_free(&vmi);
> >  free_vma:
> >  	vm_area_free(vma);
> >  unacct_error:
> > @@ -1598,10 +1577,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  abort_munmap:
> >  	vms_abort_munmap_vmas(&vms, &mas_detach);
> >  gather_failed:
> > -	validate_mm(mm);
> >  	return error;
> >  }
> > 
> > +unsigned long mmap_region(struct file *file, unsigned long addr,
> > +			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > +			  struct list_head *uf)
> > +{
> > +	unsigned long ret;
> > +	bool writable_file_mapping = false;
> > +
> > +	/* Allow architectures to sanity-check the vm_flags. */
> > +	if (!arch_validate_flags(vm_flags))
> > +		return -EINVAL;
> > +
> > +	/* Check to see if MDWE is applicable. */
> > +	if (map_deny_write_exec(vm_flags, vm_flags))
> > +		return -EACCES;
> 
> The two checks above used to be in the opposite order. Can we keep that just
> to be sure we don't change user observable behavior unnecessarily?
> 
> > +	/* Map writable and ensure this isn't a sealed memfd. */
> > +	if (file && is_shared_maywrite(vm_flags)) {
> > +		int error = mapping_map_writable(file->f_mapping);
> > +
> > +		if (error)
> > +			return error;
> > +		writable_file_mapping = true;
> > +	}
> > +
> > +	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
> > +
> > +	/* Clear our write mapping regardless of error. */
> > +	if (writable_file_mapping)
> > +		mapping_unmap_writable(file->f_mapping);
> > +
> > +	validate_mm(current->mm);
> > +	return ret;
> > +}
> > +
> >  static int __vm_munmap(unsigned long start, size_t len, bool unlock)
> >  {
> >  	int ret;
> > --
> > 2.47.0
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  2024-10-22 20:40 ` [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
  2024-10-22 21:14   ` Jann Horn
  2024-10-23  9:11   ` Vlastimil Babka
@ 2024-10-23 14:22   ` Liam R. Howlett
  2 siblings, 0 replies; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 14:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241022 16:41]:
> After an attempted mmap() fails, we are no longer in a situation where we
> can safely interact with VMA hooks. This is currently not enforced, meaning
> that we need complicated handling to ensure we do not incorrectly call
> these hooks.
> 
> We can avoid the whole issue by treating the VMA as suspect the moment that
> the file->f_ops->mmap() function reports an error by replacing whatever VMA
> operations were installed with a dummy empty set of VMA operations.
> 
> We do so through a new helper function internal to mm - mmap_file() - which
> is both more logically named than the existing call_mmap() function and
> correctly isolates handling of the vm_op reassignment to mm.
> 
> All the existing invocations of call_mmap() outside of mm are ultimately
> nested within the call_mmap() from mm, which we now replace.
> 
> It is therefore safe to leave call_mmap() in place as a convenience
> function (and to avoid churn). The invokers are:
> 
>      ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
>     coda_file_operations -> mmap -> coda_file_mmap()
>      shm_file_operations -> shm_mmap()
> shm_file_operations_huge -> shm_mmap()
>             dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
> 	                    -> i915_gem_dmabuf_mmap()
> 
> None of these callers interact with vm_ops or mappings in a problematic way
> on error, quickly exiting out.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  mm/internal.h | 27 +++++++++++++++++++++++++++
>  mm/mmap.c     |  6 +++---
>  mm/nommu.c    |  4 ++--
>  3 files changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 508f7802dd2b..af032e76dfd4 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -108,6 +108,33 @@ static inline void *folio_raw_mapping(const struct folio *folio)
>  	return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>  }
> 
> +/*
> + * This is a file-backed mapping, and is about to be memory mapped - invoke its
> + * mmap hook and safely handle error conditions. On error, VMA hooks will be
> + * mutated.
> + *
> + * @file: File which backs the mapping.
> + * @vma:  VMA which we are mapping.
> + *
> + * Returns: 0 if success, error otherwise.
> + */
> +static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
> +{
> +	int err = call_mmap(file, vma);
> +
> +	if (likely(!err))
> +		return 0;
> +
> +	/*
> +	 * OK, we tried to call the file hook for mmap(), but an error
> +	 * arose. The mapping is in an inconsistent state and we most not invoke
> +	 * any further hooks on it.
> +	 */
> +	vma->vm_ops = &vma_dummy_vm_ops;
> +
> +	return err;
> +}
> +
>  #ifdef CONFIG_MMU
> 
>  /* Flags for folio_pte_batch(). */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 1ba0878bbc30..10f4ccaf491b 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1425,7 +1425,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	/*
>  	 * clear PTEs while the vma is still in the tree so that rmap
>  	 * cannot race with the freeing later in the truncate scenario.
> -	 * This is also needed for call_mmap(), which is why vm_ops
> +	 * This is also needed for mmap_file(), which is why vm_ops
>  	 * close function is called.
>  	 */
>  	vms_clean_up_area(&vms, &mas_detach);
> @@ -1450,7 +1450,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> 
>  	if (file) {
>  		vma->vm_file = get_file(file);
> -		error = call_mmap(file, vma);
> +		error = mmap_file(file, vma);
>  		if (error)
>  			goto unmap_and_free_vma;
> 
> @@ -1473,7 +1473,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> 
>  		vma_iter_config(&vmi, addr, end);
>  		/*
> -		 * If vm_flags changed after call_mmap(), we should try merge
> +		 * If vm_flags changed after mmap_file(), we should try merge
>  		 * vma again as we may succeed this time.
>  		 */
>  		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 385b0c15add8..f9ccc02458ec 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -885,7 +885,7 @@ static int do_mmap_shared_file(struct vm_area_struct *vma)
>  {
>  	int ret;
> 
> -	ret = call_mmap(vma->vm_file, vma);
> +	ret = mmap_file(vma->vm_file, vma);
>  	if (ret == 0) {
>  		vma->vm_region->vm_top = vma->vm_region->vm_end;
>  		return 0;
> @@ -918,7 +918,7 @@ static int do_mmap_private(struct vm_area_struct *vma,
>  	 * happy.
>  	 */
>  	if (capabilities & NOMMU_MAP_DIRECT) {
> -		ret = call_mmap(vma->vm_file, vma);
> +		ret = mmap_file(vma->vm_file, vma);
>  		/* shouldn't return success if we're not sharing */
>  		if (WARN_ON_ONCE(!is_nommu_shared_mapping(vma->vm_flags)))
>  			ret = -ENOSYS;
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error
  2024-10-22 20:40 ` [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
  2024-10-22 21:15   ` Jann Horn
  2024-10-23  9:24   ` Vlastimil Babka
@ 2024-10-23 14:26   ` Liam R. Howlett
  2024-10-23 14:41   ` Liam R. Howlett
  3 siblings, 0 replies; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 14:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241022 16:41]:
> Incorrect invocation of VMA callbacks when the VMA is no longer in a
> consistent state is bug prone and risky to perform.
> 
> With regards to the important vm_ops->close() callback We have gone to
> great lengths to try to track whether or not we ought to close VMAs.
> 
> Rather than doing so and risking making a mistake somewhere, instead
> unconditionally close and reset vma->vm_ops to an empty dummy operations
> set with a NULL .close operator.
> 
> We introduce a new function to do so - vma_close() - and simplify existing
> vms logic which tracked whether we needed to close or not.
> 
> This simplifies the logic, avoids incorrect double-calling of the .close()
> callback and allows us to update error paths to simply call vma_close()
> unconditionally - making VMA closure idempotent.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  mm/internal.h | 17 +++++++++++++++++
>  mm/mmap.c     |  5 ++---
>  mm/nommu.c    |  3 +--
>  mm/vma.c      | 14 +++++---------
>  mm/vma.h      |  4 +---
>  5 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index af032e76dfd4..3a45cc592fd0 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -135,6 +135,23 @@ static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
>  	return err;
>  }
> 
> +/*
> + * Unconditionally close the VMA if it has a close hook and prevent hooks from
> + * being invoked after close. VMA hooks are mutated.
> + */
> +static inline void vma_close(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_ops && vma->vm_ops->close) {
> +		vma->vm_ops->close(vma);
> +
> +		/*
> +		 * The mapping is in an inconsistent state, and no further hooks
> +		 * may be invoked upon it.
> +		 */
> +		vma->vm_ops = &vma_dummy_vm_ops;
> +	}
> +}
> +
>  #ifdef CONFIG_MMU
> 
>  /* Flags for folio_pte_batch(). */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 10f4ccaf491b..d55c58e99a54 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1576,8 +1576,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	return addr;
> 
>  close_and_free_vma:
> -	if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
> -		vma->vm_ops->close(vma);
> +	vma_close(vma);
> 
>  	if (file || vma->vm_file) {
>  unmap_and_free_vma:
> @@ -1937,7 +1936,7 @@ void exit_mmap(struct mm_struct *mm)
>  	do {
>  		if (vma->vm_flags & VM_ACCOUNT)
>  			nr_accounted += vma_pages(vma);
> -		remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
> +		remove_vma(vma, /* unreachable = */ true);
>  		count++;
>  		cond_resched();
>  		vma = vma_next(&vmi);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index f9ccc02458ec..635d028d647b 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -589,8 +589,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
>   */
>  static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
>  {
> -	if (vma->vm_ops && vma->vm_ops->close)
> -		vma->vm_ops->close(vma);
> +	vma_close(vma);
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
>  	put_nommu_region(vma->vm_region);
> diff --git a/mm/vma.c b/mm/vma.c
> index 3c5a80876725..bb7cfa2dc282 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -323,11 +323,10 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
>  /*
>   * Close a vm structure and free it.
>   */
> -void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
> +void remove_vma(struct vm_area_struct *vma, bool unreachable)
>  {
>  	might_sleep();
> -	if (!closed && vma->vm_ops && vma->vm_ops->close)
> -		vma->vm_ops->close(vma);
> +	vma_close(vma);
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
>  	mpol_put(vma_policy(vma));
> @@ -1115,9 +1114,7 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
>  	vms_clear_ptes(vms, mas_detach, true);
>  	mas_set(mas_detach, 0);
>  	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		if (vma->vm_ops && vma->vm_ops->close)
> -			vma->vm_ops->close(vma);
> -	vms->closed_vm_ops = true;
> +		vma_close(vma);
>  }
> 
>  /*
> @@ -1160,7 +1157,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	/* Remove and clean up vmas */
>  	mas_set(mas_detach, 0);
>  	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		remove_vma(vma, /* = */ false, vms->closed_vm_ops);
> +		remove_vma(vma, /* unreachable = */ false);
> 
>  	vm_unacct_memory(vms->nr_accounted);
>  	validate_mm(mm);
> @@ -1684,8 +1681,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  	return new_vma;
> 
>  out_vma_link:
> -	if (new_vma->vm_ops && new_vma->vm_ops->close)
> -		new_vma->vm_ops->close(new_vma);
> +	vma_close(new_vma);
> 
>  	if (new_vma->vm_file)
>  		fput(new_vma->vm_file);
> diff --git a/mm/vma.h b/mm/vma.h
> index 55457cb68200..75558b5e9c8c 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -42,7 +42,6 @@ struct vma_munmap_struct {
>  	int vma_count;                  /* Number of vmas that will be removed */
>  	bool unlock;                    /* Unlock after the munmap */
>  	bool clear_ptes;                /* If there are outstanding PTE to be cleared */
> -	bool closed_vm_ops;		/* call_mmap() was encountered, so vmas may be closed */
>  	/* 1 byte hole */
>  	unsigned long nr_pages;         /* Number of pages being removed */
>  	unsigned long locked_vm;        /* Number of locked pages */
> @@ -198,7 +197,6 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
>  	vms->unmap_start = FIRST_USER_ADDRESS;
>  	vms->unmap_end = USER_PGTABLES_CEILING;
>  	vms->clear_ptes = false;
> -	vms->closed_vm_ops = false;
>  }
>  #endif
> 
> @@ -269,7 +267,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  		  unsigned long start, size_t len, struct list_head *uf,
>  		  bool unlock);
> 
> -void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
> +void remove_vma(struct vm_area_struct *vma, bool unreachable);
> 
>  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
>  		struct vm_area_struct *prev, struct vm_area_struct *next);
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec()
  2024-10-22 20:40 ` [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
  2024-10-22 21:15   ` Jann Horn
  2024-10-23  9:57   ` Vlastimil Babka
@ 2024-10-23 14:30   ` Liam R. Howlett
  2024-10-23 16:25     ` Linus Torvalds
  2024-10-23 16:37     ` Lorenzo Stoakes
  2 siblings, 2 replies; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 14:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241022 16:41]:
> Refactor the map_deny_write_exec() to not unnecessarily require a VMA
> parameter but rather to accept VMA flags parameters, which allows us to use
> this function early in mmap_region() in a subsequent commit.
> 
> While we're here, we refactor the function to be more readable and add some
> additional documentation.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  include/linux/mman.h | 21 ++++++++++++++++++---
>  mm/mmap.c            |  2 +-
>  mm/mprotect.c        |  2 +-
>  3 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index bcb201ab7a41..d71f83b821ef 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -188,16 +188,31 @@ static inline bool arch_memory_deny_write_exec_supported(void)
>   *
>   *	d)	mmap(PROT_READ | PROT_EXEC)
>   *		mmap(PROT_READ | PROT_EXEC | PROT_BTI)

nit: the arguments are not in the comments (which is a pre-existing
issue).

> + *
> + * This is only applicable if the user has set the Memory-Deny-Write-Execute
> + * (MDWE) protection mask for the current process.
> + *
> + * @old specifies the VMA flags the VMA originally possessed, and @new the ones
> + * we propose to set.
> + *
> + * Return: false if proposed change is OK, true if not ok and should be denied.
>   */
> -static inline bool map_deny_write_exec(struct vm_area_struct *vma,  unsigned long vm_flags)
> +static inline bool map_deny_write_exec(unsigned long old, unsigned long new)

Nit: somehow "new" is special in my editor.  Also, I'm too lazy to fix
my editor so I'll live with it.

>  {
> +	/* If MDWE is disabled, we have nothing to deny. */
>  	if (!test_bit(MMF_HAS_MDWE, &current->mm->flags))
>  		return false;
> 
> -	if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
> +	/* If the new VMA is not executable, we have nothing to deny. */
> +	if (!(new & VM_EXEC))
> +		return false;
> +
> +	/* Under MDWE we absolutely do not accept writably executable... */
> +	if (new & VM_WRITE)
>  		return true;
> 
> -	if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC))
> +	/* ...nor newly executable VMAs. */
> +	if (!(old & VM_EXEC))
>  		return true;
> 
>  	return false;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d55c58e99a54..66edf0ebba94 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1508,7 +1508,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		vma_set_anonymous(vma);
>  	}
> 
> -	if (map_deny_write_exec(vma, vma->vm_flags)) {
> +	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
>  		error = -EACCES;
>  		goto close_and_free_vma;
>  	}
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 0c5d6d06107d..6f450af3252e 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -810,7 +810,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
>  			break;
>  		}
> 
> -		if (map_deny_write_exec(vma, newflags)) {
> +		if (map_deny_write_exec(vma->vm_flags, newflags)) {
>  			error = -EACCES;
>  			break;
>  		}
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region()
  2024-10-22 20:40 ` [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
@ 2024-10-23 14:38   ` Vlastimil Babka
  2024-10-23 15:21     ` Liam R. Howlett
  2024-10-23 17:30     ` Lorenzo Stoakes
  2024-10-23 17:19   ` Liam R. Howlett
  1 sibling, 2 replies; 46+ messages in thread
From: Vlastimil Babka @ 2024-10-23 14:38 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/22/24 22:40, Lorenzo Stoakes wrote:
> We have seen bugs and resource leaks arise from the complexity of the
> __mmap_region() function. This, and the generally deeply fragile error
> handling logic and complexity which makes understanding the function
> difficult make it highly desirable to refactor it into something readable.
> 
> Achieve this by separating the function into smaller logical parts which
> are easier to understand and follow, and which importantly very
> significantly simplify the error handling.
> 
> Note that we now call vms_abort_munmap_vmas() in more error paths than we
> used to, however in cases where no abort need occur, vms->nr_pages will be
> equal to zero and we simply exit this function without doing more than we
> would have done previously.
> 
> Importantly, the invocation of the driver mmap hook via mmap_file() now has
> very simple and obvious handling (this was previously the most problematic
> part of the mmap() operation).
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/vma.c | 380 +++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 240 insertions(+), 140 deletions(-)
> 
> diff --git a/mm/vma.c b/mm/vma.c
> index 7617f9d50d62..a271e2b406ab 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -7,6 +7,31 @@
>  #include "vma_internal.h"
>  #include "vma.h"
> 
> +struct mmap_state {
> +	struct mm_struct *mm;
> +	struct vma_iterator *vmi;
> +	struct vma_merge_struct *vmg;
> +	struct list_head *uf;
> +
> +	struct vma_munmap_struct vms;
> +	struct ma_state mas_detach;
> +	struct maple_tree mt_detach;
> +
> +	unsigned long flags;
> +	unsigned long pglen;
> +	unsigned long charged;
> +};
> +
> +#define MMAP_STATE(name, mm_, vmi_, vmg_, uf_, flags_, len_)	\
> +	struct mmap_state name = {				\
> +		.mm = mm_,					\
> +		.vmi = vmi_,					\
> +		.vmg = vmg_,					\
> +		.uf = uf_,					\
> +		.flags = flags_,				\
> +		.pglen = PHYS_PFN(len_),			\
> +	}
> +
>  static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
>  {
>  	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
> @@ -2169,189 +2194,247 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
>  	vms_complete_munmap_vmas(vms, mas_detach);
>  }
> 
> -unsigned long __mmap_region(struct file *file, unsigned long addr,
> -		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> -		struct list_head *uf)
> +/*
> + * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
> + *                    unmapped once the map operation is completed, check limits,
> + *                    account mapping and clean up any pre-existing VMAs.
> + *
> + * @map: Mapping state.
> + *
> + * Returns: 0 on success, error code otherwise.
> + */
> +static int __mmap_prepare(struct mmap_state *map)
>  {
> -	struct mm_struct *mm = current->mm;
> -	struct vm_area_struct *vma = NULL;
> -	pgoff_t pglen = PHYS_PFN(len);
> -	unsigned long charged = 0;
> -	struct vma_munmap_struct vms;
> -	struct ma_state mas_detach;
> -	struct maple_tree mt_detach;
> -	unsigned long end = addr + len;
>  	int error;
> -	VMA_ITERATOR(vmi, mm, addr);
> -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> -
> -	vmg.file = file;
> -	/* Find the first overlapping VMA */
> -	vma = vma_find(&vmi, end);
> -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> -	if (vma) {
> -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> -		mt_on_stack(mt_detach);
> -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> +	struct vma_iterator *vmi = map->vmi;
> +	struct vma_merge_struct *vmg = map->vmg;
> +	struct vma_munmap_struct *vms = &map->vms;
> +
> +	/* Find the first overlapping VMA and initialise unmap state. */
> +	vms->vma = vma_find(vmi, vmg->end);
> +	init_vma_munmap(vms, vmi, vms->vma, vmg->start, vmg->end, map->uf,
> +			/* unlock = */ false);
> +
> +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> +	if (vms->vma) {
> +		mt_init_flags(&map->mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> +		mt_on_stack(map->mt_detach);
> +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
>  		/* Prepare to unmap any existing mapping in the area */
> -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
>  		if (error)
> -			goto gather_failed;
> +			return error;

So this assumes vms_abort_munmap_vmas() will rely on the "vms->nr_pages will
be equal to zero" mentioned in commit log. But AFAICS
vms_gather_munmap_vmas() can fail in Nth iteration of its
for_each_vma_range() after some iterations already increased nr_pages and it
will do a reattach_vmas() and return the error and we just return the error
here.
I think either here or maybe in vms_gather_munmap_vmas() itself a reset of
vms->nr_pages to zero on error should happen for the vms_abort_munmap_vmas()
to be a no-op?

> 
> -		vmg.next = vms.next;
> -		vmg.prev = vms.prev;
> -		vma = NULL;
> +		vmg->next = vms->next;
> +		vmg->prev = vms->prev;
>  	} else {
> -		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
> +		vmg->next = vma_iter_next_rewind(vmi, &vmg->prev);
>  	}
> 
>  	/* Check against address space limit. */
> -	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
> -		error = -ENOMEM;
> -		goto abort_munmap;
> -	}
> +	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
> +		return -ENOMEM;
> 
> -	/*
> -	 * Private writable mapping: check memory availability
> -	 */
> -	if (accountable_mapping(file, vm_flags)) {
> -		charged = pglen;
> -		charged -= vms.nr_accounted;
> -		if (charged) {
> -			error = security_vm_enough_memory_mm(mm, charged);
> +	/* Private writable mapping: check memory availability. */
> +	if (accountable_mapping(vmg->file, map->flags)) {
> +		map->charged = map->pglen;
> +		map->charged -= vms->nr_accounted;
> +		if (map->charged) {
> +			error = security_vm_enough_memory_mm(map->mm, map->charged);
>  			if (error)
> -				goto abort_munmap;
> +				return error;
>  		}
> 
> -		vms.nr_accounted = 0;
> -		vm_flags |= VM_ACCOUNT;
> -		vmg.flags = vm_flags;
> +		vms->nr_accounted = 0;
> +		map->flags |= VM_ACCOUNT;
>  	}
> 
>  	/*
> -	 * clear PTEs while the vma is still in the tree so that rmap
> +	 * Clear PTEs while the vma is still in the tree so that rmap
>  	 * cannot race with the freeing later in the truncate scenario.
>  	 * This is also needed for mmap_file(), which is why vm_ops
>  	 * close function is called.
>  	 */
> -	vms_clean_up_area(&vms, &mas_detach);
> -	vma = vma_merge_new_range(&vmg);
> -	if (vma)
> -		goto expanded;
> +	vms_clean_up_area(vms, &map->mas_detach);
> +
> +	return 0;
> +}
> +
> +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> +			       struct vm_area_struct **mergep)
> +{
> +	struct vma_iterator *vmi = map->vmi;
> +	struct vma_merge_struct *vmg = map->vmg;
> +	int error;
> +
> +	vma->vm_file = get_file(vmg->file);
> +	error = mmap_file(vma->vm_file, vma);
> +	if (error) {
> +		fput(vma->vm_file);
> +		vma->vm_file = NULL;
> +
> +		vma_iter_set(vmi, vma->vm_end);
> +		/* Undo any partial mapping done by a device driver. */
> +		unmap_region(&vmi->mas, vma, vmg->prev, vmg->next);
> +
> +		return error;
> +	}
> +
> +	/* Drivers cannot alter the address of the VMA. */
> +	WARN_ON_ONCE(vmg->start != vma->vm_start);
> +	/*
> +	 * Drivers should not permit writability when previously it was
> +	 * disallowed.
> +	 */
> +	VM_WARN_ON_ONCE(map->flags != vma->vm_flags &&
> +			!(map->flags & VM_MAYWRITE) &&
> +			(vma->vm_flags & VM_MAYWRITE));
> +
> +	vma_iter_config(vmi, vmg->start, vmg->end);
> +	/*
> +	 * If flags changed after mmap_file(), we should try merge
> +	 * vma again as we may succeed this time.
> +	 */
> +	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> +		struct vm_area_struct *merge;
> +
> +		vmg->flags = vma->vm_flags;
> +		/* If this fails, state is reset ready for a reattempt. */
> +		merge = vma_merge_new_range(vmg);
> +
> +		if (merge) {
> +			/*
> +			 * ->mmap() can change vma->vm_file and fput
> +			 * the original file. So fput the vma->vm_file
> +			 * here or we would add an extra fput for file
> +			 * and cause general protection fault
> +			 * ultimately.
> +			 */
> +			fput(vma->vm_file);
> +			vm_area_free(vma);

This frees the vma.

> +			vma_iter_free(vmi);
> +			*mergep = merge;
> +		} else {
> +			vma_iter_config(vmi, vmg->start, vmg->end);
> +		}
> +	}
> +
> +	map->flags = vma->vm_flags;

So this is use-after-free.

Maybe pass only a single struct vm_area_struct **vmap parameter to this
function, and in case of merge, change both vma and *vmap to it?.

Although I can see it's all moot after 8/8. Still let's not introduce a
temporary UAF step.

> +	return 0;
> +}
> +
> +/*
> + * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
> + *                    possible.
> + *
> + *                    An exception to this is if the mapping is file-backed, and
> + *                    the underlying driver changes the VMA flags, permitting a
> + *                    subsequent merge of the VMA, in which case the returned
> + *                    VMA is one that was merged on a second attempt.
> + *
> + * @map:  Mapping state.
> + * @vmap: Output pointer for the new VMA.
> + *
> + * Returns: Zero on success, or an error.
> + */
> +static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> +{
> +	struct vma_iterator *vmi = map->vmi;
> +	struct vma_merge_struct *vmg = map->vmg;
> +	struct vm_area_struct *merge = NULL;
> +	int error = 0;
> +	struct vm_area_struct *vma;
> +
>  	/*
>  	 * Determine the object being mapped and call the appropriate
>  	 * specific mapper. the address has already been validated, but
>  	 * not unmapped, but the maps are removed from the list.
>  	 */
> -	vma = vm_area_alloc(mm);
> -	if (!vma) {
> -		error = -ENOMEM;
> -		goto unacct_error;
> -	}
> +	vma = vm_area_alloc(map->mm);
> +	if (!vma)
> +		return -ENOMEM;
> 
> -	vma_iter_config(&vmi, addr, end);
> -	vma_set_range(vma, addr, end, pgoff);
> -	vm_flags_init(vma, vm_flags);
> -	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> +	vma_iter_config(vmi, vmg->start, vmg->end);
> +	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
> +	vm_flags_init(vma, map->flags);
> +	vma->vm_page_prot = vm_get_page_prot(map->flags);
> 
> -	if (vma_iter_prealloc(&vmi, vma)) {
> +	if (vma_iter_prealloc(vmi, vma)) {
>  		error = -ENOMEM;
>  		goto free_vma;
>  	}
> 
> -	if (file) {
> -		vma->vm_file = get_file(file);
> -		error = mmap_file(file, vma);
> -		if (error)
> -			goto unmap_and_free_file_vma;
> -
> -		/* Drivers cannot alter the address of the VMA. */
> -		WARN_ON_ONCE(addr != vma->vm_start);
> -		/*
> -		 * Drivers should not permit writability when previously it was
> -		 * disallowed.
> -		 */
> -		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> -				!(vm_flags & VM_MAYWRITE) &&
> -				(vma->vm_flags & VM_MAYWRITE));
> -
> -		vma_iter_config(&vmi, addr, end);
> -		/*
> -		 * If vm_flags changed after mmap_file(), we should try merge
> -		 * vma again as we may succeed this time.
> -		 */
> -		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> -			struct vm_area_struct *merge;
> -
> -			vmg.flags = vma->vm_flags;
> -			/* If this fails, state is reset ready for a reattempt. */
> -			merge = vma_merge_new_range(&vmg);
> -
> -			if (merge) {
> -				/*
> -				 * ->mmap() can change vma->vm_file and fput
> -				 * the original file. So fput the vma->vm_file
> -				 * here or we would add an extra fput for file
> -				 * and cause general protection fault
> -				 * ultimately.
> -				 */
> -				fput(vma->vm_file);
> -				vm_area_free(vma);
> -				vma_iter_free(&vmi);
> -				vma = merge;
> -				/* Update vm_flags to pick up the change. */
> -				vm_flags = vma->vm_flags;
> -				goto file_expanded;
> -			}
> -			vma_iter_config(&vmi, addr, end);
> -		}
> -
> -		vm_flags = vma->vm_flags;
> -	} else if (vm_flags & VM_SHARED) {
> +	if (vmg->file)
> +		error = __mmap_new_file_vma(map, vma, &merge);
> +	else if (map->flags & VM_SHARED)
>  		error = shmem_zero_setup(vma);
> -		if (error)
> -			goto free_iter_vma;
> -	} else {
> +	else
>  		vma_set_anonymous(vma);
> -	}
> +
> +	if (error)
> +		goto free_iter_vma;
> +
> +	if (merge)
> +		goto file_expanded;
> 
>  #ifdef CONFIG_SPARC64
>  	/* TODO: Fix SPARC ADI! */
> -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> +	WARN_ON_ONCE(!arch_validate_flags(map->flags));
>  #endif
> 
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
> -	vma_iter_store(&vmi, vma);
> -	mm->map_count++;
> +	vma_iter_store(vmi, vma);
> +	map->mm->map_count++;
>  	vma_link_file(vma);
> 
>  	/*
>  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
>  	 * call covers the non-merge case.
>  	 */
> -	khugepaged_enter_vma(vma, vma->vm_flags);
> +	khugepaged_enter_vma(vma, map->flags);
> 
>  file_expanded:
> -	file = vma->vm_file;
>  	ksm_add_vma(vma);
> -expanded:
> +
> +	*vmap = vma;
> +	return 0;
> +
> +free_iter_vma:
> +	vma_iter_free(vmi);
> +free_vma:
> +	vm_area_free(vma);
> +	return error;
> +}
> +
> +/*
> + * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
> + *                     statistics, handle locking and finalise the VMA.
> + *
> + * @map: Mapping state.
> + * @vma: Merged or newly allocated VMA for the mmap()'d region.
> + */
> +static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = map->mm;
> +	unsigned long vm_flags = vma->vm_flags;
> +
>  	perf_event_mmap(vma);
> 
> -	/* Unmap any existing mapping in the area */
> -	vms_complete_munmap_vmas(&vms, &mas_detach);
> +	/* Unmap any existing mapping in the area. */
> +	vms_complete_munmap_vmas(&map->vms, &map->mas_detach);
> 
> -	vm_stat_account(mm, vm_flags, pglen);
> +	vm_stat_account(mm, vma->vm_flags, map->pglen);
>  	if (vm_flags & VM_LOCKED) {
>  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>  					is_vm_hugetlb_page(vma) ||
> -					vma == get_gate_vma(current->mm))
> +					vma == get_gate_vma(mm))
>  			vm_flags_clear(vma, VM_LOCKED_MASK);
>  		else
> -			mm->locked_vm += pglen;
> +			mm->locked_vm += map->pglen;
>  	}
> 
> -	if (file)
> +	if (vma->vm_file)
>  		uprobe_mmap(vma);
> 
>  	/*
> @@ -2364,26 +2447,43 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
>  	vm_flags_set(vma, VM_SOFTDIRTY);
> 
>  	vma_set_page_prot(vma);
> +}
> 
> -	return addr;
> +unsigned long __mmap_region(struct file *file, unsigned long addr,
> +		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> +		struct list_head *uf)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma;
> +	int error;
> +	VMA_ITERATOR(vmi, mm, addr);
> +	VMG_STATE(vmg, mm, &vmi, addr, addr + len, vm_flags, pgoff);
> +	MMAP_STATE(map, mm, &vmi, &vmg, uf, vm_flags, len);
> 
> -unmap_and_free_file_vma:
> -	fput(vma->vm_file);
> -	vma->vm_file = NULL;
> +	vmg.file = file;
> 
> -	vma_iter_set(&vmi, vma->vm_end);
> -	/* Undo any partial mapping done by a device driver. */
> -	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> -free_iter_vma:
> -	vma_iter_free(&vmi);
> -free_vma:
> -	vm_area_free(vma);
> -unacct_error:
> -	if (charged)
> -		vm_unacct_memory(charged);
> +	error = __mmap_prepare(&map);
> +	if (error)
> +		goto abort_munmap;
> +
> +	/* Attempt to merge with adjacent VMAs... */
> +	vmg.flags = map.flags;
> +	vma = vma_merge_new_range(&vmg);
> +	if (!vma) {
> +		/* ...but if we can't, allocate a new VMA. */
> +		error = __mmap_new_vma(&map, &vma);
> +		if (error)
> +			goto unacct_error;
> +	}
> +
> +	__mmap_complete(&map, vma);
> 
> +	return addr;
> +
> +unacct_error:
> +	if (map.charged)
> +		vm_unacct_memory(map.charged);
>  abort_munmap:
> -	vms_abort_munmap_vmas(&vms, &mas_detach);
> -gather_failed:
> +	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
>  	return error;
>  }
> --
> 2.47.0



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error
  2024-10-22 20:40 ` [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
                     ` (2 preceding siblings ...)
  2024-10-23 14:26   ` Liam R. Howlett
@ 2024-10-23 14:41   ` Liam R. Howlett
  3 siblings, 0 replies; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 14:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241022 16:41]:
> Incorrect invocation of VMA callbacks when the VMA is no longer in a
> consistent state is bug prone and risky to perform.
> 
> With regards to the important vm_ops->close() callback We have gone to
> great lengths to try to track whether or not we ought to close VMAs.
> 
> Rather than doing so and risking making a mistake somewhere, instead
> unconditionally close and reset vma->vm_ops to an empty dummy operations
> set with a NULL .close operator.
> 
> We introduce a new function to do so - vma_close() - and simplify existing
> vms logic which tracked whether we needed to close or not.
> 
> This simplifies the logic, avoids incorrect double-calling of the .close()
> callback and allows us to update error paths to simply call vma_close()
> unconditionally - making VMA closure idempotent.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  mm/internal.h | 17 +++++++++++++++++
>  mm/mmap.c     |  5 ++---
>  mm/nommu.c    |  3 +--
>  mm/vma.c      | 14 +++++---------
>  mm/vma.h      |  4 +---
>  5 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index af032e76dfd4..3a45cc592fd0 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -135,6 +135,23 @@ static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
>  	return err;
>  }
> 
> +/*
> + * Unconditionally close the VMA if it has a close hook and prevent hooks from
> + * being invoked after close. VMA hooks are mutated.
> + */
> +static inline void vma_close(struct vm_area_struct *vma)
> +{
> +	if (vma->vm_ops && vma->vm_ops->close) {
> +		vma->vm_ops->close(vma);
> +
> +		/*
> +		 * The mapping is in an inconsistent state, and no further hooks
> +		 * may be invoked upon it.
> +		 */
> +		vma->vm_ops = &vma_dummy_vm_ops;
> +	}
> +}
> +
>  #ifdef CONFIG_MMU
> 
>  /* Flags for folio_pte_batch(). */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 10f4ccaf491b..d55c58e99a54 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1576,8 +1576,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	return addr;
> 
>  close_and_free_vma:
> -	if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
> -		vma->vm_ops->close(vma);
> +	vma_close(vma);
> 
>  	if (file || vma->vm_file) {
>  unmap_and_free_vma:
> @@ -1937,7 +1936,7 @@ void exit_mmap(struct mm_struct *mm)
>  	do {
>  		if (vma->vm_flags & VM_ACCOUNT)
>  			nr_accounted += vma_pages(vma);
> -		remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
> +		remove_vma(vma, /* unreachable = */ true);
>  		count++;
>  		cond_resched();
>  		vma = vma_next(&vmi);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index f9ccc02458ec..635d028d647b 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -589,8 +589,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
>   */
>  static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
>  {
> -	if (vma->vm_ops && vma->vm_ops->close)
> -		vma->vm_ops->close(vma);
> +	vma_close(vma);
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
>  	put_nommu_region(vma->vm_region);
> diff --git a/mm/vma.c b/mm/vma.c
> index 3c5a80876725..bb7cfa2dc282 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -323,11 +323,10 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
>  /*
>   * Close a vm structure and free it.
>   */
> -void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
> +void remove_vma(struct vm_area_struct *vma, bool unreachable)
>  {
>  	might_sleep();
> -	if (!closed && vma->vm_ops && vma->vm_ops->close)
> -		vma->vm_ops->close(vma);
> +	vma_close(vma);
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
>  	mpol_put(vma_policy(vma));
> @@ -1115,9 +1114,7 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
>  	vms_clear_ptes(vms, mas_detach, true);
>  	mas_set(mas_detach, 0);
>  	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		if (vma->vm_ops && vma->vm_ops->close)
> -			vma->vm_ops->close(vma);
> -	vms->closed_vm_ops = true;
> +		vma_close(vma);
>  }
> 
>  /*
> @@ -1160,7 +1157,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	/* Remove and clean up vmas */
>  	mas_set(mas_detach, 0);
>  	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		remove_vma(vma, /* = */ false, vms->closed_vm_ops);
> +		remove_vma(vma, /* unreachable = */ false);
> 
>  	vm_unacct_memory(vms->nr_accounted);
>  	validate_mm(mm);
> @@ -1684,8 +1681,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  	return new_vma;
> 
>  out_vma_link:
> -	if (new_vma->vm_ops && new_vma->vm_ops->close)
> -		new_vma->vm_ops->close(new_vma);
> +	vma_close(new_vma);
> 
>  	if (new_vma->vm_file)
>  		fput(new_vma->vm_file);
> diff --git a/mm/vma.h b/mm/vma.h
> index 55457cb68200..75558b5e9c8c 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -42,7 +42,6 @@ struct vma_munmap_struct {
>  	int vma_count;                  /* Number of vmas that will be removed */
>  	bool unlock;                    /* Unlock after the munmap */
>  	bool clear_ptes;                /* If there are outstanding PTE to be cleared */
> -	bool closed_vm_ops;		/* call_mmap() was encountered, so vmas may be closed */
>  	/* 1 byte hole */
>  	unsigned long nr_pages;         /* Number of pages being removed */
>  	unsigned long locked_vm;        /* Number of locked pages */
> @@ -198,7 +197,6 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
>  	vms->unmap_start = FIRST_USER_ADDRESS;
>  	vms->unmap_end = USER_PGTABLES_CEILING;
>  	vms->clear_ptes = false;
> -	vms->closed_vm_ops = false;
>  }
>  #endif
> 
> @@ -269,7 +267,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  		  unsigned long start, size_t len, struct list_head *uf,
>  		  bool unlock);
> 
> -void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
> +void remove_vma(struct vm_area_struct *vma, bool unreachable);
> 
>  void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
>  		struct vm_area_struct *prev, struct vm_area_struct *next);
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs
  2024-10-22 20:40 ` [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs Lorenzo Stoakes
@ 2024-10-23 15:01   ` Vlastimil Babka
  2024-10-23 15:16     ` Lorenzo Stoakes
  0 siblings, 1 reply; 46+ messages in thread
From: Vlastimil Babka @ 2024-10-23 15:01 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/22/24 22:40, Lorenzo Stoakes wrote:
> Previously, we'd always try to merge a file-backed VMA if its flags were
> changed by the driver.
> 
> This however is rarely meaningful as typically the flags would be changed
> to VM_PFNMAP or other VM_SPECIAL flags which are inherently unmergable.
> 
> In cases where it is meaningful (for instance DAX) it is doubtful that this

Hm if that's true, I'm imagining many piecemeal mmap()s of DAX that used to
merge but now will create tons of VMA's, which doesn't sound great. Then it
has also potentially breaking effects on mremap() which doesn't work accross
multiple VMA's.

> optimisation is worth the effort and maintenance risk of having to unwind
> state and perform a merge.

What if we simply created a new vma but only then checked if the flags
changed and we can merge it with neighbours (i.e. like the mprotect()
merging case). Less efficient, but less tricky and with the same result
hopefully?

> Since we've observed bugs and resource leaks due to complexity in this
> area, it is simply not acceptable to have a 'nice to have' optimisation
> like this complicating an already very complicated code path, so let's
> simply eliminate it.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/vma.c | 39 ++-------------------------------------
>  1 file changed, 2 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/vma.c b/mm/vma.c
> index a271e2b406ab..fe1fe5099e78 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -2260,8 +2260,7 @@ static int __mmap_prepare(struct mmap_state *map)
>  	return 0;
>  }
> 
> -static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> -			       struct vm_area_struct **mergep)
> +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma)
>  {
>  	struct vma_iterator *vmi = map->vmi;
>  	struct vma_merge_struct *vmg = map->vmg;
> @@ -2291,34 +2290,6 @@ static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vm
>  			(vma->vm_flags & VM_MAYWRITE));
> 
>  	vma_iter_config(vmi, vmg->start, vmg->end);
> -	/*
> -	 * If flags changed after mmap_file(), we should try merge
> -	 * vma again as we may succeed this time.
> -	 */
> -	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> -		struct vm_area_struct *merge;
> -
> -		vmg->flags = vma->vm_flags;
> -		/* If this fails, state is reset ready for a reattempt. */
> -		merge = vma_merge_new_range(vmg);
> -
> -		if (merge) {
> -			/*
> -			 * ->mmap() can change vma->vm_file and fput
> -			 * the original file. So fput the vma->vm_file
> -			 * here or we would add an extra fput for file
> -			 * and cause general protection fault
> -			 * ultimately.
> -			 */
> -			fput(vma->vm_file);
> -			vm_area_free(vma);
> -			vma_iter_free(vmi);
> -			*mergep = merge;
> -		} else {
> -			vma_iter_config(vmi, vmg->start, vmg->end);
> -		}
> -	}
> -
>  	map->flags = vma->vm_flags;
>  	return 0;
>  }
> @@ -2341,7 +2312,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  {
>  	struct vma_iterator *vmi = map->vmi;
>  	struct vma_merge_struct *vmg = map->vmg;
> -	struct vm_area_struct *merge = NULL;
>  	int error = 0;
>  	struct vm_area_struct *vma;
> 
> @@ -2365,7 +2335,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  	}
> 
>  	if (vmg->file)
> -		error = __mmap_new_file_vma(map, vma, &merge);
> +		error = __mmap_new_file_vma(map, vma);
>  	else if (map->flags & VM_SHARED)
>  		error = shmem_zero_setup(vma);
>  	else
> @@ -2374,9 +2344,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  	if (error)
>  		goto free_iter_vma;
> 
> -	if (merge)
> -		goto file_expanded;
> -
>  #ifdef CONFIG_SPARC64
>  	/* TODO: Fix SPARC ADI! */
>  	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> @@ -2393,8 +2360,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
>  	 * call covers the non-merge case.
>  	 */
>  	khugepaged_enter_vma(vma, map->flags);
> -
> -file_expanded:
>  	ksm_add_vma(vma);
> 
>  	*vmap = vma;
> --
> 2.47.0



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-22 20:40 ` [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
  2024-10-23 12:58   ` Vlastimil Babka
@ 2024-10-23 15:02   ` Liam R. Howlett
  1 sibling, 0 replies; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 15:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241022 16:41]:
> The mmap_region() function is somewhat terrifying, with spaghetti-like
> control flow and numerous means by which issues can arise and incomplete
> state, memory leaks and other unpleasantness can occur.
> 
> A large amount of the complexity arises from trying to handle errors late
> in the process of mapping a VMA, which forms the basis of recently observed
> issues with resource leaks and observable inconsistent state.
> 
> Taking advantage of previous patches in this series we move a number of
> checks earlier in the code, simplifying things by moving the core of the
> logic into a static internal function __mmap_region().
> 
> Doing this allows us to perform a number of checks up front before we do
> any real work, and allows us to unwind the writable unmap check
> unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
> validation unconditionally also.
> 
> We move a number of things here:
> 
> 1. We preallocate memory for the iterator before we call the file-backed
>    memory hook, allowing us to exit early and avoid having to perform
>    complicated and error-prone close/free logic. We carefully free
>    iterator state on both success and error paths.
> 
> 2. The enclosing mmap_region() function handles the mapping_map_writable()
>    logic early. Previously the logic had the mapping_map_writable() at the
>    point of mapping a newly allocated file-backed VMA, and a matching
>    mapping_unmap_writable() on success and error paths.
> 
>    We now do this unconditionally if this is a file-backed, shared writable
>    mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
>    doing so does not invalidate the seal check we just performed, and we in
>    any case always decrement the counter in the wrapper.
> 
>    We perform a debug assert to ensure a driver does not attempt to do the
>    opposite.
> 
> 3. We also move arch_validate_flags() up into the mmap_region()
>    function. This is only relevant on arm64 and sparc64, and the check is
>    only meaningful for SPARC with ADI enabled. We explicitly add a warning
>    for this arch if a driver invalidates this check, though the code ought
>    eventually to be fixed to eliminate the need for this.
> 
> With all of these measures in place, we no longer need to explicitly close
> the VMA on error paths, as we place all checks which might fail prior to a
> call to any driver mmap hook.
> 
> This eliminates an entire class of errors, makes the code easier to reason
> about and more robust.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> Cc: stable <stable@kernel.org>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  mm/mmap.c | 120 ++++++++++++++++++++++++++++++------------------------
>  1 file changed, 66 insertions(+), 54 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 66edf0ebba94..7d02b47a1895 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1361,20 +1361,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
>  	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
>  }
> 
> -unsigned long mmap_region(struct file *file, unsigned long addr,
> +static unsigned long __mmap_region(struct file *file, unsigned long addr,
>  		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
>  		struct list_head *uf)
>  {
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = NULL;
>  	pgoff_t pglen = PHYS_PFN(len);
> -	struct vm_area_struct *merge;
>  	unsigned long charged = 0;
>  	struct vma_munmap_struct vms;
>  	struct ma_state mas_detach;
>  	struct maple_tree mt_detach;
>  	unsigned long end = addr + len;
> -	bool writable_file_mapping = false;
>  	int error;
>  	VMA_ITERATOR(vmi, mm, addr);
>  	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> @@ -1448,28 +1446,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	vm_flags_init(vma, vm_flags);
>  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> 
> +	if (vma_iter_prealloc(&vmi, vma)) {
> +		error = -ENOMEM;
> +		goto free_vma;
> +	}
> +
>  	if (file) {
>  		vma->vm_file = get_file(file);
>  		error = mmap_file(file, vma);
>  		if (error)
> -			goto unmap_and_free_vma;
> -
> -		if (vma_is_shared_maywrite(vma)) {
> -			error = mapping_map_writable(file->f_mapping);
> -			if (error)
> -				goto close_and_free_vma;
> -
> -			writable_file_mapping = true;
> -		}
> +			goto unmap_and_free_file_vma;
> 
> +		/* Drivers cannot alter the address of the VMA. */
> +		WARN_ON_ONCE(addr != vma->vm_start);
>  		/*
> -		 * Expansion is handled above, merging is handled below.
> -		 * Drivers should not alter the address of the VMA.
> +		 * Drivers should not permit writability when previously it was
> +		 * disallowed.
>  		 */
> -		if (WARN_ON((addr != vma->vm_start))) {
> -			error = -EINVAL;
> -			goto close_and_free_vma;
> -		}
> +		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> +				!(vm_flags & VM_MAYWRITE) &&
> +				(vma->vm_flags & VM_MAYWRITE));
> 
>  		vma_iter_config(&vmi, addr, end);
>  		/*
> @@ -1477,6 +1473,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		 * vma again as we may succeed this time.
>  		 */
>  		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> +			struct vm_area_struct *merge;
> +
>  			vmg.flags = vma->vm_flags;
>  			/* If this fails, state is reset ready for a reattempt. */
>  			merge = vma_merge_new_range(&vmg);
> @@ -1491,10 +1489,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  				 */
>  				fput(vma->vm_file);
>  				vm_area_free(vma);
> +				vma_iter_free(&vmi);
>  				vma = merge;

Nit: Might be worth moving the vma_iter_free() down (if not removed) so
that the vma pointer remains sane.  Just in case more stuff gets stuffed
between the two calls.

>  				/* Update vm_flags to pick up the change. */
>  				vm_flags = vma->vm_flags;
> -				goto unmap_writable;
> +				goto file_expanded;
>  			}
>  			vma_iter_config(&vmi, addr, end);
>  		}
> @@ -1503,26 +1502,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	} else if (vm_flags & VM_SHARED) {
>  		error = shmem_zero_setup(vma);
>  		if (error)
> -			goto free_vma;
> +			goto free_iter_vma;
>  	} else {
>  		vma_set_anonymous(vma);
>  	}
> 
> -	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
> -		error = -EACCES;
> -		goto close_and_free_vma;
> -	}
> -
> -	/* Allow architectures to sanity-check the vm_flags */
> -	if (!arch_validate_flags(vma->vm_flags)) {
> -		error = -EINVAL;
> -		goto close_and_free_vma;
> -	}
> -
> -	if (vma_iter_prealloc(&vmi, vma)) {
> -		error = -ENOMEM;
> -		goto close_and_free_vma;
> -	}
> +#ifdef CONFIG_SPARC64
> +	/* TODO: Fix SPARC ADI! */
> +	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> +#endif
> 
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
> @@ -1536,10 +1524,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 */
>  	khugepaged_enter_vma(vma, vma->vm_flags);
> 
> -	/* Once vma denies write, undo our temporary denial count */
> -unmap_writable:
> -	if (writable_file_mapping)
> -		mapping_unmap_writable(file->f_mapping);
> +file_expanded:
>  	file = vma->vm_file;
>  	ksm_add_vma(vma);
>  expanded:
> @@ -1572,23 +1557,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> 
>  	vma_set_page_prot(vma);
> 
> -	validate_mm(mm);
>  	return addr;
> 
> -close_and_free_vma:
> -	vma_close(vma);
> -
> -	if (file || vma->vm_file) {
> -unmap_and_free_vma:
> -		fput(vma->vm_file);
> -		vma->vm_file = NULL;
> +unmap_and_free_file_vma:
> +	fput(vma->vm_file);
> +	vma->vm_file = NULL;
> 
> -		vma_iter_set(&vmi, vma->vm_end);
> -		/* Undo any partial mapping done by a device driver. */
> -		unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> -	}
> -	if (writable_file_mapping)
> -		mapping_unmap_writable(file->f_mapping);
> +	vma_iter_set(&vmi, vma->vm_end);
> +	/* Undo any partial mapping done by a device driver. */
> +	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> +free_iter_vma:
> +	vma_iter_free(&vmi);
>  free_vma:
>  	vm_area_free(vma);
>  unacct_error:
> @@ -1598,10 +1577,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  abort_munmap:
>  	vms_abort_munmap_vmas(&vms, &mas_detach);
>  gather_failed:
> -	validate_mm(mm);
>  	return error;
>  }
> 
> +unsigned long mmap_region(struct file *file, unsigned long addr,
> +			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> +			  struct list_head *uf)
> +{
> +	unsigned long ret;
> +	bool writable_file_mapping = false;
> +
> +	/* Allow architectures to sanity-check the vm_flags. */
> +	if (!arch_validate_flags(vm_flags))
> +		return -EINVAL;
> +
> +	/* Check to see if MDWE is applicable. */
> +	if (map_deny_write_exec(vm_flags, vm_flags))
> +		return -EACCES;
> +
> +	/* Map writable and ensure this isn't a sealed memfd. */
> +	if (file && is_shared_maywrite(vm_flags)) {
> +		int error = mapping_map_writable(file->f_mapping);
> +
> +		if (error)
> +			return error;
> +		writable_file_mapping = true;
> +	}
> +
> +	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
> +
> +	/* Clear our write mapping regardless of error. */
> +	if (writable_file_mapping)
> +		mapping_unmap_writable(file->f_mapping);
> +
> +	validate_mm(current->mm);
> +	return ret;
> +}
> +
>  static int __vm_munmap(unsigned long start, size_t len, bool unlock)
>  {
>  	int ret;
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs
  2024-10-23 15:01   ` Vlastimil Babka
@ 2024-10-23 15:16     ` Lorenzo Stoakes
  2024-10-23 18:16       ` Liam R. Howlett
  0 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 15:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 05:01:29PM +0200, Vlastimil Babka wrote:
> On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > Previously, we'd always try to merge a file-backed VMA if its flags were
> > changed by the driver.
> >
> > This however is rarely meaningful as typically the flags would be changed
> > to VM_PFNMAP or other VM_SPECIAL flags which are inherently unmergable.
> >
> > In cases where it is meaningful (for instance DAX) it is doubtful that this
>
> Hm if that's true, I'm imagining many piecemeal mmap()s of DAX that used to
> merge but now will create tons of VMA's, which doesn't sound great. Then it
> has also potentially breaking effects on mremap() which doesn't work accross
> multiple VMA's.

I said this repeatedly to you and Liam but you both still seemed to want
this :))

Anyway, yes. I mean you're pretty crazy if you are mapping a bunch of
adjacent DAX ranges that are all otherwise mergeable next to one another,
then on that basis assuming that you can mremap() the whole thing.

>
> > optimisation is worth the effort and maintenance risk of having to unwind
> > state and perform a merge.
>
> What if we simply created a new vma but only then checked if the flags
> changed and we can merge it with neighbours (i.e. like the mprotect()
> merging case). Less efficient, but less tricky and with the same result
> hopefully?

I'd probably rather just drop this idea rather than wade into something
entirely new, but let me look at whether we can leverage
vma_modify_flags().

I have a feeling we can't because we already explicitly reset state in the
merge new VMA case, and we'd be introducing a new way in which state could
get mangled.

But I'll take a look and see, otherwise we can just drop this for now and
potentially come back to it later, the key bit of the non-backport patches
are 5-7 anyway.

>
> > Since we've observed bugs and resource leaks due to complexity in this
> > area, it is simply not acceptable to have a 'nice to have' optimisation
> > like this complicating an already very complicated code path, so let's
> > simply eliminate it.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  mm/vma.c | 39 ++-------------------------------------
> >  1 file changed, 2 insertions(+), 37 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index a271e2b406ab..fe1fe5099e78 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -2260,8 +2260,7 @@ static int __mmap_prepare(struct mmap_state *map)
> >  	return 0;
> >  }
> >
> > -static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> > -			       struct vm_area_struct **mergep)
> > +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma)
> >  {
> >  	struct vma_iterator *vmi = map->vmi;
> >  	struct vma_merge_struct *vmg = map->vmg;
> > @@ -2291,34 +2290,6 @@ static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vm
> >  			(vma->vm_flags & VM_MAYWRITE));
> >
> >  	vma_iter_config(vmi, vmg->start, vmg->end);
> > -	/*
> > -	 * If flags changed after mmap_file(), we should try merge
> > -	 * vma again as we may succeed this time.
> > -	 */
> > -	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> > -		struct vm_area_struct *merge;
> > -
> > -		vmg->flags = vma->vm_flags;
> > -		/* If this fails, state is reset ready for a reattempt. */
> > -		merge = vma_merge_new_range(vmg);
> > -
> > -		if (merge) {
> > -			/*
> > -			 * ->mmap() can change vma->vm_file and fput
> > -			 * the original file. So fput the vma->vm_file
> > -			 * here or we would add an extra fput for file
> > -			 * and cause general protection fault
> > -			 * ultimately.
> > -			 */
> > -			fput(vma->vm_file);
> > -			vm_area_free(vma);
> > -			vma_iter_free(vmi);
> > -			*mergep = merge;
> > -		} else {
> > -			vma_iter_config(vmi, vmg->start, vmg->end);
> > -		}
> > -	}
> > -
> >  	map->flags = vma->vm_flags;
> >  	return 0;
> >  }
> > @@ -2341,7 +2312,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >  {
> >  	struct vma_iterator *vmi = map->vmi;
> >  	struct vma_merge_struct *vmg = map->vmg;
> > -	struct vm_area_struct *merge = NULL;
> >  	int error = 0;
> >  	struct vm_area_struct *vma;
> >
> > @@ -2365,7 +2335,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >  	}
> >
> >  	if (vmg->file)
> > -		error = __mmap_new_file_vma(map, vma, &merge);
> > +		error = __mmap_new_file_vma(map, vma);
> >  	else if (map->flags & VM_SHARED)
> >  		error = shmem_zero_setup(vma);
> >  	else
> > @@ -2374,9 +2344,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >  	if (error)
> >  		goto free_iter_vma;
> >
> > -	if (merge)
> > -		goto file_expanded;
> > -
> >  #ifdef CONFIG_SPARC64
> >  	/* TODO: Fix SPARC ADI! */
> >  	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> > @@ -2393,8 +2360,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> >  	 * call covers the non-merge case.
> >  	 */
> >  	khugepaged_enter_vma(vma, map->flags);
> > -
> > -file_expanded:
> >  	ksm_add_vma(vma);
> >
> >  	*vmap = vma;
> > --
> > 2.47.0
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region()
  2024-10-23 14:38   ` Vlastimil Babka
@ 2024-10-23 15:21     ` Liam R. Howlett
  2024-10-23 17:39       ` Lorenzo Stoakes
  2024-10-23 17:30     ` Lorenzo Stoakes
  1 sibling, 1 reply; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 15:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, Andrew Morton, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Vlastimil Babka <vbabka@suse.cz> [241023 10:39]:
> On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > We have seen bugs and resource leaks arise from the complexity of the
> > __mmap_region() function. This, and the generally deeply fragile error
> > handling logic and complexity which makes understanding the function
> > difficult make it highly desirable to refactor it into something readable.
> > 
> > Achieve this by separating the function into smaller logical parts which
> > are easier to understand and follow, and which importantly very
> > significantly simplify the error handling.
> > 
> > Note that we now call vms_abort_munmap_vmas() in more error paths than we
> > used to, however in cases where no abort need occur, vms->nr_pages will be
> > equal to zero and we simply exit this function without doing more than we
> > would have done previously.
> > 
> > Importantly, the invocation of the driver mmap hook via mmap_file() now has
> > very simple and obvious handling (this was previously the most problematic
> > part of the mmap() operation).
> > 
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  mm/vma.c | 380 +++++++++++++++++++++++++++++++++++--------------------
> >  1 file changed, 240 insertions(+), 140 deletions(-)
> > 
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 7617f9d50d62..a271e2b406ab 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -7,6 +7,31 @@
> >  #include "vma_internal.h"
> >  #include "vma.h"
> > 
> > +struct mmap_state {
> > +	struct mm_struct *mm;
> > +	struct vma_iterator *vmi;
> > +	struct vma_merge_struct *vmg;
> > +	struct list_head *uf;
> > +
> > +	struct vma_munmap_struct vms;
> > +	struct ma_state mas_detach;
> > +	struct maple_tree mt_detach;
> > +
> > +	unsigned long flags;
> > +	unsigned long pglen;
> > +	unsigned long charged;
> > +};
> > +
> > +#define MMAP_STATE(name, mm_, vmi_, vmg_, uf_, flags_, len_)	\
> > +	struct mmap_state name = {				\
> > +		.mm = mm_,					\
> > +		.vmi = vmi_,					\
> > +		.vmg = vmg_,					\
> > +		.uf = uf_,					\
> > +		.flags = flags_,				\
> > +		.pglen = PHYS_PFN(len_),			\
> > +	}
> > +
> >  static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
> >  {
> >  	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
> > @@ -2169,189 +2194,247 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> >  	vms_complete_munmap_vmas(vms, mas_detach);
> >  }
> > 
> > -unsigned long __mmap_region(struct file *file, unsigned long addr,
> > -		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > -		struct list_head *uf)
> > +/*
> > + * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
> > + *                    unmapped once the map operation is completed, check limits,
> > + *                    account mapping and clean up any pre-existing VMAs.
> > + *
> > + * @map: Mapping state.
> > + *
> > + * Returns: 0 on success, error code otherwise.
> > + */
> > +static int __mmap_prepare(struct mmap_state *map)
> >  {
> > -	struct mm_struct *mm = current->mm;
> > -	struct vm_area_struct *vma = NULL;
> > -	pgoff_t pglen = PHYS_PFN(len);
> > -	unsigned long charged = 0;
> > -	struct vma_munmap_struct vms;
> > -	struct ma_state mas_detach;
> > -	struct maple_tree mt_detach;
> > -	unsigned long end = addr + len;
> >  	int error;
> > -	VMA_ITERATOR(vmi, mm, addr);
> > -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > -
> > -	vmg.file = file;
> > -	/* Find the first overlapping VMA */
> > -	vma = vma_find(&vmi, end);
> > -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > -	if (vma) {
> > -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > -		mt_on_stack(mt_detach);
> > -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	struct vma_munmap_struct *vms = &map->vms;
> > +
> > +	/* Find the first overlapping VMA and initialise unmap state. */
> > +	vms->vma = vma_find(vmi, vmg->end);
> > +	init_vma_munmap(vms, vmi, vms->vma, vmg->start, vmg->end, map->uf,
> > +			/* unlock = */ false);
> > +
> > +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> > +	if (vms->vma) {
> > +		mt_init_flags(&map->mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > +		mt_on_stack(map->mt_detach);
> > +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
> >  		/* Prepare to unmap any existing mapping in the area */
> > -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> > +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
> >  		if (error)
> > -			goto gather_failed;
> > +			return error;
> 
> So this assumes vms_abort_munmap_vmas() will rely on the "vms->nr_pages will
> be equal to zero" mentioned in commit log. But AFAICS
> vms_gather_munmap_vmas() can fail in Nth iteration of its
> for_each_vma_range() after some iterations already increased nr_pages and it
> will do a reattach_vmas() and return the error and we just return the error
> here.
> I think either here or maybe in vms_gather_munmap_vmas() itself a reset of
> vms->nr_pages to zero on error should happen for the vms_abort_munmap_vmas()
> to be a no-op?

Probably in reattach_vmas()?

> 
> > 
> > -		vmg.next = vms.next;
> > -		vmg.prev = vms.prev;
> > -		vma = NULL;
> > +		vmg->next = vms->next;
> > +		vmg->prev = vms->prev;
> >  	} else {
> > -		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
> > +		vmg->next = vma_iter_next_rewind(vmi, &vmg->prev);
> >  	}
> > 
> >  	/* Check against address space limit. */
> > -	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
> > -		error = -ENOMEM;
> > -		goto abort_munmap;
> > -	}
> > +	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
> > +		return -ENOMEM;
> > 
> > -	/*
> > -	 * Private writable mapping: check memory availability
> > -	 */
> > -	if (accountable_mapping(file, vm_flags)) {
> > -		charged = pglen;
> > -		charged -= vms.nr_accounted;
> > -		if (charged) {
> > -			error = security_vm_enough_memory_mm(mm, charged);
> > +	/* Private writable mapping: check memory availability. */
> > +	if (accountable_mapping(vmg->file, map->flags)) {
> > +		map->charged = map->pglen;
> > +		map->charged -= vms->nr_accounted;
> > +		if (map->charged) {
> > +			error = security_vm_enough_memory_mm(map->mm, map->charged);
> >  			if (error)
> > -				goto abort_munmap;
> > +				return error;
> >  		}
> > 
> > -		vms.nr_accounted = 0;
> > -		vm_flags |= VM_ACCOUNT;
> > -		vmg.flags = vm_flags;
> > +		vms->nr_accounted = 0;
> > +		map->flags |= VM_ACCOUNT;
> >  	}
> > 
> >  	/*
> > -	 * clear PTEs while the vma is still in the tree so that rmap
> > +	 * Clear PTEs while the vma is still in the tree so that rmap
> >  	 * cannot race with the freeing later in the truncate scenario.
> >  	 * This is also needed for mmap_file(), which is why vm_ops
> >  	 * close function is called.
> >  	 */
> > -	vms_clean_up_area(&vms, &mas_detach);
> > -	vma = vma_merge_new_range(&vmg);
> > -	if (vma)
> > -		goto expanded;
> > +	vms_clean_up_area(vms, &map->mas_detach);
> > +
> > +	return 0;
> > +}
> > +
> > +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> > +			       struct vm_area_struct **mergep)
> > +{
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	int error;
> > +
> > +	vma->vm_file = get_file(vmg->file);
> > +	error = mmap_file(vma->vm_file, vma);
> > +	if (error) {
> > +		fput(vma->vm_file);
> > +		vma->vm_file = NULL;
> > +
> > +		vma_iter_set(vmi, vma->vm_end);
> > +		/* Undo any partial mapping done by a device driver. */
> > +		unmap_region(&vmi->mas, vma, vmg->prev, vmg->next);
> > +
> > +		return error;
> > +	}
> > +
> > +	/* Drivers cannot alter the address of the VMA. */
> > +	WARN_ON_ONCE(vmg->start != vma->vm_start);
> > +	/*
> > +	 * Drivers should not permit writability when previously it was
> > +	 * disallowed.
> > +	 */
> > +	VM_WARN_ON_ONCE(map->flags != vma->vm_flags &&
> > +			!(map->flags & VM_MAYWRITE) &&
> > +			(vma->vm_flags & VM_MAYWRITE));
> > +
> > +	vma_iter_config(vmi, vmg->start, vmg->end);
> > +	/*
> > +	 * If flags changed after mmap_file(), we should try merge
> > +	 * vma again as we may succeed this time.
> > +	 */
> > +	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> > +		struct vm_area_struct *merge;
> > +
> > +		vmg->flags = vma->vm_flags;
> > +		/* If this fails, state is reset ready for a reattempt. */
> > +		merge = vma_merge_new_range(vmg);
> > +
> > +		if (merge) {
> > +			/*
> > +			 * ->mmap() can change vma->vm_file and fput
> > +			 * the original file. So fput the vma->vm_file
> > +			 * here or we would add an extra fput for file
> > +			 * and cause general protection fault
> > +			 * ultimately.
> > +			 */
> > +			fput(vma->vm_file);
> > +			vm_area_free(vma);
> 
> This frees the vma.
> 
> > +			vma_iter_free(vmi);
> > +			*mergep = merge;
> > +		} else {
> > +			vma_iter_config(vmi, vmg->start, vmg->end);
> > +		}
> > +	}
> > +
> > +	map->flags = vma->vm_flags;
> 
> So this is use-after-free.
> 
> Maybe pass only a single struct vm_area_struct **vmap parameter to this
> function, and in case of merge, change both vma and *vmap to it?.
> 
> Although I can see it's all moot after 8/8. Still let's not introduce a
> temporary UAF step.
> 
> > +	return 0;
> > +}
> > +
> > +/*
> > + * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
> > + *                    possible.
> > + *
> > + *                    An exception to this is if the mapping is file-backed, and
> > + *                    the underlying driver changes the VMA flags, permitting a
> > + *                    subsequent merge of the VMA, in which case the returned
> > + *                    VMA is one that was merged on a second attempt.
> > + *
> > + * @map:  Mapping state.
> > + * @vmap: Output pointer for the new VMA.
> > + *
> > + * Returns: Zero on success, or an error.
> > + */
> > +static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > +{
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	struct vm_area_struct *merge = NULL;
> > +	int error = 0;
> > +	struct vm_area_struct *vma;
> > +
> >  	/*
> >  	 * Determine the object being mapped and call the appropriate
> >  	 * specific mapper. the address has already been validated, but
> >  	 * not unmapped, but the maps are removed from the list.
> >  	 */
> > -	vma = vm_area_alloc(mm);
> > -	if (!vma) {
> > -		error = -ENOMEM;
> > -		goto unacct_error;
> > -	}
> > +	vma = vm_area_alloc(map->mm);
> > +	if (!vma)
> > +		return -ENOMEM;
> > 
> > -	vma_iter_config(&vmi, addr, end);
> > -	vma_set_range(vma, addr, end, pgoff);
> > -	vm_flags_init(vma, vm_flags);
> > -	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > +	vma_iter_config(vmi, vmg->start, vmg->end);
> > +	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
> > +	vm_flags_init(vma, map->flags);
> > +	vma->vm_page_prot = vm_get_page_prot(map->flags);
> > 
> > -	if (vma_iter_prealloc(&vmi, vma)) {
> > +	if (vma_iter_prealloc(vmi, vma)) {
> >  		error = -ENOMEM;
> >  		goto free_vma;
> >  	}
> > 
> > -	if (file) {
> > -		vma->vm_file = get_file(file);
> > -		error = mmap_file(file, vma);
> > -		if (error)
> > -			goto unmap_and_free_file_vma;
> > -
> > -		/* Drivers cannot alter the address of the VMA. */
> > -		WARN_ON_ONCE(addr != vma->vm_start);
> > -		/*
> > -		 * Drivers should not permit writability when previously it was
> > -		 * disallowed.
> > -		 */
> > -		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> > -				!(vm_flags & VM_MAYWRITE) &&
> > -				(vma->vm_flags & VM_MAYWRITE));
> > -
> > -		vma_iter_config(&vmi, addr, end);
> > -		/*
> > -		 * If vm_flags changed after mmap_file(), we should try merge
> > -		 * vma again as we may succeed this time.
> > -		 */
> > -		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> > -			struct vm_area_struct *merge;
> > -
> > -			vmg.flags = vma->vm_flags;
> > -			/* If this fails, state is reset ready for a reattempt. */
> > -			merge = vma_merge_new_range(&vmg);
> > -
> > -			if (merge) {
> > -				/*
> > -				 * ->mmap() can change vma->vm_file and fput
> > -				 * the original file. So fput the vma->vm_file
> > -				 * here or we would add an extra fput for file
> > -				 * and cause general protection fault
> > -				 * ultimately.
> > -				 */
> > -				fput(vma->vm_file);
> > -				vm_area_free(vma);
> > -				vma_iter_free(&vmi);
> > -				vma = merge;
> > -				/* Update vm_flags to pick up the change. */
> > -				vm_flags = vma->vm_flags;
> > -				goto file_expanded;
> > -			}
> > -			vma_iter_config(&vmi, addr, end);
> > -		}
> > -
> > -		vm_flags = vma->vm_flags;
> > -	} else if (vm_flags & VM_SHARED) {
> > +	if (vmg->file)
> > +		error = __mmap_new_file_vma(map, vma, &merge);
> > +	else if (map->flags & VM_SHARED)
> >  		error = shmem_zero_setup(vma);
> > -		if (error)
> > -			goto free_iter_vma;
> > -	} else {
> > +	else
> >  		vma_set_anonymous(vma);
> > -	}
> > +
> > +	if (error)
> > +		goto free_iter_vma;
> > +
> > +	if (merge)
> > +		goto file_expanded;
> > 
> >  #ifdef CONFIG_SPARC64
> >  	/* TODO: Fix SPARC ADI! */
> > -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> > +	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> >  #endif
> > 
> >  	/* Lock the VMA since it is modified after insertion into VMA tree */
> >  	vma_start_write(vma);
> > -	vma_iter_store(&vmi, vma);
> > -	mm->map_count++;
> > +	vma_iter_store(vmi, vma);
> > +	map->mm->map_count++;
> >  	vma_link_file(vma);
> > 
> >  	/*
> >  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
> >  	 * call covers the non-merge case.
> >  	 */
> > -	khugepaged_enter_vma(vma, vma->vm_flags);
> > +	khugepaged_enter_vma(vma, map->flags);
> > 
> >  file_expanded:
> > -	file = vma->vm_file;
> >  	ksm_add_vma(vma);
> > -expanded:
> > +
> > +	*vmap = vma;
> > +	return 0;
> > +
> > +free_iter_vma:
> > +	vma_iter_free(vmi);
> > +free_vma:
> > +	vm_area_free(vma);
> > +	return error;
> > +}
> > +
> > +/*
> > + * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
> > + *                     statistics, handle locking and finalise the VMA.
> > + *
> > + * @map: Mapping state.
> > + * @vma: Merged or newly allocated VMA for the mmap()'d region.
> > + */
> > +static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> > +{
> > +	struct mm_struct *mm = map->mm;
> > +	unsigned long vm_flags = vma->vm_flags;
> > +
> >  	perf_event_mmap(vma);
> > 
> > -	/* Unmap any existing mapping in the area */
> > -	vms_complete_munmap_vmas(&vms, &mas_detach);
> > +	/* Unmap any existing mapping in the area. */
> > +	vms_complete_munmap_vmas(&map->vms, &map->mas_detach);
> > 
> > -	vm_stat_account(mm, vm_flags, pglen);
> > +	vm_stat_account(mm, vma->vm_flags, map->pglen);
> >  	if (vm_flags & VM_LOCKED) {
> >  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> >  					is_vm_hugetlb_page(vma) ||
> > -					vma == get_gate_vma(current->mm))
> > +					vma == get_gate_vma(mm))
> >  			vm_flags_clear(vma, VM_LOCKED_MASK);
> >  		else
> > -			mm->locked_vm += pglen;
> > +			mm->locked_vm += map->pglen;
> >  	}
> > 
> > -	if (file)
> > +	if (vma->vm_file)
> >  		uprobe_mmap(vma);
> > 
> >  	/*
> > @@ -2364,26 +2447,43 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
> >  	vm_flags_set(vma, VM_SOFTDIRTY);
> > 
> >  	vma_set_page_prot(vma);
> > +}
> > 
> > -	return addr;
> > +unsigned long __mmap_region(struct file *file, unsigned long addr,
> > +		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > +		struct list_head *uf)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct vm_area_struct *vma;
> > +	int error;
> > +	VMA_ITERATOR(vmi, mm, addr);
> > +	VMG_STATE(vmg, mm, &vmi, addr, addr + len, vm_flags, pgoff);
> > +	MMAP_STATE(map, mm, &vmi, &vmg, uf, vm_flags, len);
> > 
> > -unmap_and_free_file_vma:
> > -	fput(vma->vm_file);
> > -	vma->vm_file = NULL;
> > +	vmg.file = file;
> > 
> > -	vma_iter_set(&vmi, vma->vm_end);
> > -	/* Undo any partial mapping done by a device driver. */
> > -	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > -free_iter_vma:
> > -	vma_iter_free(&vmi);
> > -free_vma:
> > -	vm_area_free(vma);
> > -unacct_error:
> > -	if (charged)
> > -		vm_unacct_memory(charged);
> > +	error = __mmap_prepare(&map);
> > +	if (error)
> > +		goto abort_munmap;
> > +
> > +	/* Attempt to merge with adjacent VMAs... */
> > +	vmg.flags = map.flags;
> > +	vma = vma_merge_new_range(&vmg);
> > +	if (!vma) {
> > +		/* ...but if we can't, allocate a new VMA. */
> > +		error = __mmap_new_vma(&map, &vma);
> > +		if (error)
> > +			goto unacct_error;
> > +	}
> > +
> > +	__mmap_complete(&map, vma);
> > 
> > +	return addr;
> > +
> > +unacct_error:
> > +	if (map.charged)
> > +		vm_unacct_memory(map.charged);
> >  abort_munmap:
> > -	vms_abort_munmap_vmas(&vms, &mas_detach);
> > -gather_failed:
> > +	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
> >  	return error;
> >  }
> > --
> > 2.47.0
> 


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec()
  2024-10-23 14:30   ` Liam R. Howlett
@ 2024-10-23 16:25     ` Linus Torvalds
  2024-10-23 16:37     ` Lorenzo Stoakes
  1 sibling, 0 replies; 46+ messages in thread
From: Linus Torvalds @ 2024-10-23 16:25 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, Andrew Morton, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Linus Torvalds, Peter Xu

On Wed, 23 Oct 2024 at 07:30, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> Nit: somehow "new" is special in my editor.  Also, I'm too lazy to fix
> my editor so I'll live with it.

I think your editor might be in some "edit C++" mode, where "new" is a keyword.

So it might be as simple as telling your editor that it's just C.

Or your editor just doesn't know or care about the difference.

               Linus


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec()
  2024-10-22 21:15   ` Jann Horn
@ 2024-10-23 16:35     ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:35 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Tue, Oct 22, 2024 at 11:15:18PM +0200, Jann Horn wrote:
> On Tue, Oct 22, 2024 at 10:41 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Refactor the map_deny_write_exec() to not unnecessarily require a VMA
> > parameter but rather to accept VMA flags parameters, which allows us to use
> > this function early in mmap_region() in a subsequent commit.
> >
> > While we're here, we refactor the function to be more readable and add some
> > additional documentation.
> >
> > Reported-by: Jann Horn <jannh@google.com>
> > Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> > Cc: stable <stable@kernel.org>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Jann Horn <jannh@google.com>

Thanks!

>
> [..]
> > -static inline bool map_deny_write_exec(struct vm_area_struct *vma,  unsigned long vm_flags)
> > +static inline bool map_deny_write_exec(unsigned long old, unsigned long new)
> >  {
> > +       /* If MDWE is disabled, we have nothing to deny. */
> >         if (!test_bit(MMF_HAS_MDWE, &current->mm->flags))
> >                 return false;
> >
> > -       if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
> > +       /* If the new VMA is not executable, we have nothing to deny. */
> > +       if (!(new & VM_EXEC))
> > +               return false;
> > +
> > +       /* Under MDWE we absolutely do not accept writably executable... */
> > +       if (new & VM_WRITE)
> >                 return true;
> >
> > -       if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC))
> > +       /* ...nor newly executable VMAs. */
>
> nit: maybe clarify this as "nor existing VMAs newly becoming
> executable" or something like that

Ack

>
>
> > +       if (!(old & VM_EXEC))
> >                 return true;
> >
> >         return false;


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec()
  2024-10-23 14:30   ` Liam R. Howlett
  2024-10-23 16:25     ` Linus Torvalds
@ 2024-10-23 16:37     ` Lorenzo Stoakes
  1 sibling, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:37 UTC (permalink / raw)
  To: Liam R. Howlett, Andrew Morton, Vlastimil Babka, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 10:30:46AM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241022 16:41]:
> > Refactor the map_deny_write_exec() to not unnecessarily require a VMA
> > parameter but rather to accept VMA flags parameters, which allows us to use
> > this function early in mmap_region() in a subsequent commit.
> >
> > While we're here, we refactor the function to be more readable and add some
> > additional documentation.
> >
> > Reported-by: Jann Horn <jannh@google.com>
> > Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> > Cc: stable <stable@kernel.org>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Thanks!

>
> > ---
> >  include/linux/mman.h | 21 ++++++++++++++++++---
> >  mm/mmap.c            |  2 +-
> >  mm/mprotect.c        |  2 +-
> >  3 files changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index bcb201ab7a41..d71f83b821ef 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -188,16 +188,31 @@ static inline bool arch_memory_deny_write_exec_supported(void)
> >   *
> >   *	d)	mmap(PROT_READ | PROT_EXEC)
> >   *		mmap(PROT_READ | PROT_EXEC | PROT_BTI)
>
> nit: the arguments are not in the comments (which is a pre-existing
> issue).

I add them! ----
               |
>              |
> > + *        |
> > + * This is only applicable if the user has set the Memory-Deny-Write-Execute
> > + * (MDWE) protection mask for the current process.
> > + *   v----|---------------------------------------------------------v
> > + * @old specifies the VMA flags the VMA originally possessed, and @new the ones
> > + * we propose to set.
> > + *
> > + * Return: false if proposed change is OK, true if not ok and should be denied.
> >   */
> > -static inline bool map_deny_write_exec(struct vm_area_struct *vma,  unsigned long vm_flags)
> > +static inline bool map_deny_write_exec(unsigned long old, unsigned long new)
>
> Nit: somehow "new" is special in my editor.  Also, I'm too lazy to fix
> my editor so I'll live with it.

Time for a "new" editor! ;)

>
> >  {
> > +	/* If MDWE is disabled, we have nothing to deny. */
> >  	if (!test_bit(MMF_HAS_MDWE, &current->mm->flags))
> >  		return false;
> >
> > -	if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
> > +	/* If the new VMA is not executable, we have nothing to deny. */
> > +	if (!(new & VM_EXEC))
> > +		return false;
> > +
> > +	/* Under MDWE we absolutely do not accept writably executable... */
> > +	if (new & VM_WRITE)
> >  		return true;
> >
> > -	if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC))
> > +	/* ...nor newly executable VMAs. */
> > +	if (!(old & VM_EXEC))
> >  		return true;
> >
> >  	return false;
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index d55c58e99a54..66edf0ebba94 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1508,7 +1508,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		vma_set_anonymous(vma);
> >  	}
> >
> > -	if (map_deny_write_exec(vma, vma->vm_flags)) {
> > +	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
> >  		error = -EACCES;
> >  		goto close_and_free_vma;
> >  	}
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 0c5d6d06107d..6f450af3252e 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -810,7 +810,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
> >  			break;
> >  		}
> >
> > -		if (map_deny_write_exec(vma, newflags)) {
> > +		if (map_deny_write_exec(vma->vm_flags, newflags)) {
> >  			error = -EACCES;
> >  			break;
> >  		}
> > --
> > 2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  2024-10-22 21:14   ` Jann Horn
@ 2024-10-23 16:56     ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Tue, Oct 22, 2024 at 11:14:58PM +0200, Jann Horn wrote:
> On Tue, Oct 22, 2024 at 10:41 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > After an attempted mmap() fails, we are no longer in a situation where we
> > can safely interact with VMA hooks. This is currently not enforced, meaning
> > that we need complicated handling to ensure we do not incorrectly call
> > these hooks.
> >
> > We can avoid the whole issue by treating the VMA as suspect the moment that
> > the file->f_ops->mmap() function reports an error by replacing whatever VMA
> > operations were installed with a dummy empty set of VMA operations.
> >
> > We do so through a new helper function internal to mm - mmap_file() - which
> > is both more logically named than the existing call_mmap() function and
> > correctly isolates handling of the vm_op reassignment to mm.
> >
> > All the existing invocations of call_mmap() outside of mm are ultimately
> > nested within the call_mmap() from mm, which we now replace.
> >
> > It is therefore safe to leave call_mmap() in place as a convenience
> > function (and to avoid churn). The invokers are:
> >
> >      ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
> >     coda_file_operations -> mmap -> coda_file_mmap()
> >      shm_file_operations -> shm_mmap()
> > shm_file_operations_huge -> shm_mmap()
> >             dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
> >                             -> i915_gem_dmabuf_mmap()
> >
> > None of these callers interact with vm_ops or mappings in a problematic way
> > on error, quickly exiting out.
> >
> > Reported-by: Jann Horn <jannh@google.com>
> > Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
>
> (I guess the "Fixes" tag here is a little unconventional in that it
> doesn't actually point at the commit introducing the issue that this
> commit describes, but it does mark to where the fix should be
> backported, so I guess it makes sense and I don't have any better
> suggestion.)

Yeah it's unfortunate but I think the only thing we can do here.

>
> > Cc: stable <stable@kernel.org>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Jann Horn <jannh@google.com>

Thanks!


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error
  2024-10-23  9:24   ` Vlastimil Babka
@ 2024-10-23 16:58     ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 16:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 11:24:40AM +0200, Vlastimil Babka wrote:
> On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > Incorrect invocation of VMA callbacks when the VMA is no longer in a
> > consistent state is bug prone and risky to perform.
> >
> > With regards to the important vm_ops->close() callback We have gone to
> > great lengths to try to track whether or not we ought to close VMAs.
> >
> > Rather than doing so and risking making a mistake somewhere, instead
> > unconditionally close and reset vma->vm_ops to an empty dummy operations
> > set with a NULL .close operator.
> >
> > We introduce a new function to do so - vma_close() - and simplify existing
> > vms logic which tracked whether we needed to close or not.
> >
> > This simplifies the logic, avoids incorrect double-calling of the .close()
> > callback and allows us to update error paths to simply call vma_close()
> > unconditionally - making VMA closure idempotent.
> >
> > Reported-by: Jann Horn <jannh@google.com>
> > Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> > Cc: stable <stable@kernel.org>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Nice simplification. Nit below.

Thanks!

>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> > +/*
> > + * Unconditionally close the VMA if it has a close hook and prevent hooks from
> > + * being invoked after close. VMA hooks are mutated.
> > + */
> > +static inline void vma_close(struct vm_area_struct *vma)
> > +{
> > +	if (vma->vm_ops && vma->vm_ops->close) {
> > +		vma->vm_ops->close(vma);
> > +
> > +		/*
> > +		 * The mapping is in an inconsistent state, and no further hooks
> > +		 * may be invoked upon it.
> > +		 */
> > +		vma->vm_ops = &vma_dummy_vm_ops;
> > +	}
>
> Nit: if we want to "prevent hooks" as in "any hooks" then we should be
> replacing existing vm_ops even if it has no close hook? If it's enough to
> prevent further close() hooks (as commit log suggests) then the
> implementation is fine but the comment might be misleading.

We prevent hooks _after close_, if it has no close, then no, but I'll update the
comment to be crystal clear.

>
> > +}
> > +
> >  #ifdef CONFIG_MMU
> >
> >  /* Flags for folio_pte_batch(). */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 10f4ccaf491b..d55c58e99a54 100644


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error
  2024-10-22 21:15   ` Jann Horn
@ 2024-10-23 17:00     ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 17:00 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Tue, Oct 22, 2024 at 11:15:10PM +0200, Jann Horn wrote:
> On Tue, Oct 22, 2024 at 10:41 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> > Incorrect invocation of VMA callbacks when the VMA is no longer in a
> > consistent state is bug prone and risky to perform.
> >
> > With regards to the important vm_ops->close() callback We have gone to
> > great lengths to try to track whether or not we ought to close VMAs.
> >
> > Rather than doing so and risking making a mistake somewhere, instead
> > unconditionally close and reset vma->vm_ops to an empty dummy operations
> > set with a NULL .close operator.
> >
> > We introduce a new function to do so - vma_close() - and simplify existing
> > vms logic which tracked whether we needed to close or not.
> >
> > This simplifies the logic, avoids incorrect double-calling of the .close()
> > callback and allows us to update error paths to simply call vma_close()
> > unconditionally - making VMA closure idempotent.
> >
> > Reported-by: Jann Horn <jannh@google.com>
> > Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> > Cc: stable <stable@kernel.org>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Jann Horn <jannh@google.com>

Thanks!

>
> [...]
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 55457cb68200..75558b5e9c8c 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -42,7 +42,6 @@ struct vma_munmap_struct {
> >         int vma_count;                  /* Number of vmas that will be removed */
> >         bool unlock;                    /* Unlock after the munmap */
> >         bool clear_ptes;                /* If there are outstanding PTE to be cleared */
> > -       bool closed_vm_ops;             /* call_mmap() was encountered, so vmas may be closed */
> >         /* 1 byte hole */
>
> nit: outdated comment, this hole is 2 bytes now

Ack, will update.

>
>
>
> >         unsigned long nr_pages;         /* Number of pages being removed */
> >         unsigned long locked_vm;        /* Number of locked pages */


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-23 14:20     ` Liam R. Howlett
@ 2024-10-23 17:11       ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 17:11 UTC (permalink / raw)
  To: Liam R. Howlett, Vlastimil Babka, Andrew Morton, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 10:20:50AM -0400, Liam R. Howlett wrote:
> * Vlastimil Babka <vbabka@suse.cz> [241023 08:59]:
> > On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > > The mmap_region() function is somewhat terrifying, with spaghetti-like
> > > control flow and numerous means by which issues can arise and incomplete
> > > state, memory leaks and other unpleasantness can occur.
> > >
> > > A large amount of the complexity arises from trying to handle errors late
> > > in the process of mapping a VMA, which forms the basis of recently observed
> > > issues with resource leaks and observable inconsistent state.
> > >
> > > Taking advantage of previous patches in this series we move a number of
> > > checks earlier in the code, simplifying things by moving the core of the
> > > logic into a static internal function __mmap_region().
> > >
> > > Doing this allows us to perform a number of checks up front before we do
> > > any real work, and allows us to unwind the writable unmap check
> > > unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
> > > validation unconditionally also.
> > >
> > > We move a number of things here:
> > >
> > > 1. We preallocate memory for the iterator before we call the file-backed
> > >    memory hook, allowing us to exit early and avoid having to perform
> > >    complicated and error-prone close/free logic. We carefully free
> > >    iterator state on both success and error paths.
> > >
> > > 2. The enclosing mmap_region() function handles the mapping_map_writable()
> > >    logic early. Previously the logic had the mapping_map_writable() at the
> > >    point of mapping a newly allocated file-backed VMA, and a matching
> > >    mapping_unmap_writable() on success and error paths.
> > >
> > >    We now do this unconditionally if this is a file-backed, shared writable
> > >    mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
> > >    doing so does not invalidate the seal check we just performed, and we in
> > >    any case always decrement the counter in the wrapper.
> > >
> > >    We perform a debug assert to ensure a driver does not attempt to do the
> > >    opposite.
> > >
> > > 3. We also move arch_validate_flags() up into the mmap_region()
> > >    function. This is only relevant on arm64 and sparc64, and the check is
> > >    only meaningful for SPARC with ADI enabled. We explicitly add a warning
> > >    for this arch if a driver invalidates this check, though the code ought
> > >    eventually to be fixed to eliminate the need for this.
> > >
> > > With all of these measures in place, we no longer need to explicitly close
> > > the VMA on error paths, as we place all checks which might fail prior to a
> > > call to any driver mmap hook.
> > >
> > > This eliminates an entire class of errors, makes the code easier to reason
> > > about and more robust.
> > >
> > > Reported-by: Jann Horn <jannh@google.com>
> > > Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
> > > Cc: stable <stable@kernel.org>
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> >
> > some nits below
> >
> > > ---
> > >  mm/mmap.c | 120 ++++++++++++++++++++++++++++++------------------------
> > >  1 file changed, 66 insertions(+), 54 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 66edf0ebba94..7d02b47a1895 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1361,20 +1361,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> > >  	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
> > >  }
> > >
> > > -unsigned long mmap_region(struct file *file, unsigned long addr,
> > > +static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > >  		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > >  		struct list_head *uf)
> > >  {
> > >  	struct mm_struct *mm = current->mm;
> > >  	struct vm_area_struct *vma = NULL;
> > >  	pgoff_t pglen = PHYS_PFN(len);
> > > -	struct vm_area_struct *merge;
> > >  	unsigned long charged = 0;
> > >  	struct vma_munmap_struct vms;
> > >  	struct ma_state mas_detach;
> > >  	struct maple_tree mt_detach;
> > >  	unsigned long end = addr + len;
> > > -	bool writable_file_mapping = false;
> > >  	int error;
> > >  	VMA_ITERATOR(vmi, mm, addr);
> > >  	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > > @@ -1448,28 +1446,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	vm_flags_init(vma, vm_flags);
> > >  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > >
> > > +	if (vma_iter_prealloc(&vmi, vma)) {
> > > +		error = -ENOMEM;
> > > +		goto free_vma;
> > > +	}
> > > +
> > >  	if (file) {
> > >  		vma->vm_file = get_file(file);
> > >  		error = mmap_file(file, vma);
> > >  		if (error)
> > > -			goto unmap_and_free_vma;
> > > -
> > > -		if (vma_is_shared_maywrite(vma)) {
> > > -			error = mapping_map_writable(file->f_mapping);
> > > -			if (error)
> > > -				goto close_and_free_vma;
> > > -
> > > -			writable_file_mapping = true;
> > > -		}
> > > +			goto unmap_and_free_file_vma;
> > >
> > > +		/* Drivers cannot alter the address of the VMA. */
> > > +		WARN_ON_ONCE(addr != vma->vm_start);
> > >  		/*
> > > -		 * Expansion is handled above, merging is handled below.
> > > -		 * Drivers should not alter the address of the VMA.
> > > +		 * Drivers should not permit writability when previously it was
> > > +		 * disallowed.
> > >  		 */
> > > -		if (WARN_ON((addr != vma->vm_start))) {
> > > -			error = -EINVAL;
> > > -			goto close_and_free_vma;
> > > -		}
> > > +		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> > > +				!(vm_flags & VM_MAYWRITE) &&
> > > +				(vma->vm_flags & VM_MAYWRITE));
> > >
> > >  		vma_iter_config(&vmi, addr, end);
> >
> > I wonder if this one could be removed, earlier above we did the same config
> > and neither parameters changed? But it was true before this patch as well,
> > and maybe it's further refactored away later in the series, just noting.
>
> Yes, this was here in case the vma changed address, so it's probably not
> necessary.

Hmm, but this was what we already did so I'd rather leave it in for now and
we can perhaps address it later?

>
> >
> > >  		/*
> > > @@ -1477,6 +1473,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  		 * vma again as we may succeed this time.
> > >  		 */
> > >  		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> > > +			struct vm_area_struct *merge;
> > > +
> > >  			vmg.flags = vma->vm_flags;
> > >  			/* If this fails, state is reset ready for a reattempt. */
> > >  			merge = vma_merge_new_range(&vmg);
> > > @@ -1491,10 +1489,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  				 */
> > >  				fput(vma->vm_file);
> > >  				vm_area_free(vma);
> > > +				vma_iter_free(&vmi);
> >
> > If we merged successfully, I think this is not necessary? But doesn't hurt?
>
> Yes, it will use the allocations (and re-allocate more if necessary)
> then free the unused allocations once this call path reaches
> commit_merge() with the same vmi, which is nice.
>
> And yes, it is safe to do regardless.

I will remove if this isn't necessary actually, I did think it would be as
I thought maybe we'd preallocate _twice_ here otherwise? But nice that it
all gets cleaned up.

>
> To be honest, this whole block is so rare that I want to delete it
> anyways.

Yeah I mean I'm inclined to agree... but that last commit is somewhat
contentious it seems :)

>
> >
> > >  				vma = merge;
> > >  				/* Update vm_flags to pick up the change. */
> > >  				vm_flags = vma->vm_flags;
> > > -				goto unmap_writable;
> > > +				goto file_expanded;
> > >  			}
> > >  			vma_iter_config(&vmi, addr, end);
> > >  		}
> > > @@ -1503,26 +1502,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	} else if (vm_flags & VM_SHARED) {
> > >  		error = shmem_zero_setup(vma);
> > >  		if (error)
> > > -			goto free_vma;
> > > +			goto free_iter_vma;
> > >  	} else {
> > >  		vma_set_anonymous(vma);
> > >  	}
> > >
> > > -	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
> > > -		error = -EACCES;
> > > -		goto close_and_free_vma;
> > > -	}
> > > -
> > > -	/* Allow architectures to sanity-check the vm_flags */
> > > -	if (!arch_validate_flags(vma->vm_flags)) {
> > > -		error = -EINVAL;
> > > -		goto close_and_free_vma;
> > > -	}
> > > -
> > > -	if (vma_iter_prealloc(&vmi, vma)) {
> > > -		error = -ENOMEM;
> > > -		goto close_and_free_vma;
> > > -	}
> > > +#ifdef CONFIG_SPARC64
> > > +	/* TODO: Fix SPARC ADI! */
> > > +	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> > > +#endif
> > >
> > >  	/* Lock the VMA since it is modified after insertion into VMA tree */
> > >  	vma_start_write(vma);
> > > @@ -1536,10 +1524,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	 */
> > >  	khugepaged_enter_vma(vma, vma->vm_flags);
> > >
> > > -	/* Once vma denies write, undo our temporary denial count */
> > > -unmap_writable:
> > > -	if (writable_file_mapping)
> > > -		mapping_unmap_writable(file->f_mapping);
> > > +file_expanded:
> > >  	file = vma->vm_file;
> > >  	ksm_add_vma(vma);
> > >  expanded:
> > > @@ -1572,23 +1557,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >
> > >  	vma_set_page_prot(vma);
> > >
> > > -	validate_mm(mm);
> > >  	return addr;
> > >
> > > -close_and_free_vma:
> > > -	vma_close(vma);
> > > -
> > > -	if (file || vma->vm_file) {
> > > -unmap_and_free_vma:
> > > -		fput(vma->vm_file);
> > > -		vma->vm_file = NULL;
> > > +unmap_and_free_file_vma:
> > > +	fput(vma->vm_file);
> > > +	vma->vm_file = NULL;
> > >
> > > -		vma_iter_set(&vmi, vma->vm_end);
> > > -		/* Undo any partial mapping done by a device driver. */
> > > -		unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > > -	}
> > > -	if (writable_file_mapping)
> > > -		mapping_unmap_writable(file->f_mapping);
> > > +	vma_iter_set(&vmi, vma->vm_end);
> > > +	/* Undo any partial mapping done by a device driver. */
> > > +	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > > +free_iter_vma:
> > > +	vma_iter_free(&vmi);
> > >  free_vma:
> > >  	vm_area_free(vma);
> > >  unacct_error:
> > > @@ -1598,10 +1577,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  abort_munmap:
> > >  	vms_abort_munmap_vmas(&vms, &mas_detach);
> > >  gather_failed:
> > > -	validate_mm(mm);
> > >  	return error;
> > >  }
> > >
> > > +unsigned long mmap_region(struct file *file, unsigned long addr,
> > > +			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > > +			  struct list_head *uf)
> > > +{
> > > +	unsigned long ret;
> > > +	bool writable_file_mapping = false;
> > > +
> > > +	/* Allow architectures to sanity-check the vm_flags. */
> > > +	if (!arch_validate_flags(vm_flags))
> > > +		return -EINVAL;
> > > +
> > > +	/* Check to see if MDWE is applicable. */
> > > +	if (map_deny_write_exec(vm_flags, vm_flags))
> > > +		return -EACCES;
> >
> > The two checks above used to be in the opposite order. Can we keep that just
> > to be sure we don't change user observable behavior unnecessarily?

Ack will do

> >
> > > +	/* Map writable and ensure this isn't a sealed memfd. */
> > > +	if (file && is_shared_maywrite(vm_flags)) {
> > > +		int error = mapping_map_writable(file->f_mapping);
> > > +
> > > +		if (error)
> > > +			return error;
> > > +		writable_file_mapping = true;
> > > +	}
> > > +
> > > +	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
> > > +
> > > +	/* Clear our write mapping regardless of error. */
> > > +	if (writable_file_mapping)
> > > +		mapping_unmap_writable(file->f_mapping);
> > > +
> > > +	validate_mm(current->mm);
> > > +	return ret;
> > > +}
> > > +
> > >  static int __vm_munmap(unsigned long start, size_t len, bool unlock)
> > >  {
> > >  	int ret;
> > > --
> > > 2.47.0
> >


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region()
  2024-10-22 20:40 ` [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
  2024-10-23 14:38   ` Vlastimil Babka
@ 2024-10-23 17:19   ` Liam R. Howlett
  2024-10-23 17:52     ` Lorenzo Stoakes
  1 sibling, 1 reply; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 17:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241022 16:41]:
> We have seen bugs and resource leaks arise from the complexity of the
> __mmap_region() function. This, and the generally deeply fragile error
> handling logic and complexity which makes understanding the function
> difficult make it highly desirable to refactor it into something readable.
> 
> Achieve this by separating the function into smaller logical parts which
> are easier to understand and follow, and which importantly very
> significantly simplify the error handling.
> 
> Note that we now call vms_abort_munmap_vmas() in more error paths than we
> used to, however in cases where no abort need occur, vms->nr_pages will be
> equal to zero and we simply exit this function without doing more than we
> would have done previously.
> 
> Importantly, the invocation of the driver mmap hook via mmap_file() now has
> very simple and obvious handling (this was previously the most problematic
> part of the mmap() operation).
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/vma.c | 380 +++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 240 insertions(+), 140 deletions(-)
> 
> diff --git a/mm/vma.c b/mm/vma.c
> index 7617f9d50d62..a271e2b406ab 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -7,6 +7,31 @@
>  #include "vma_internal.h"
>  #include "vma.h"
> 
> +struct mmap_state {
> +	struct mm_struct *mm;
> +	struct vma_iterator *vmi;
> +	struct vma_merge_struct *vmg;
> +	struct list_head *uf;
> +
> +	struct vma_munmap_struct vms;
> +	struct ma_state mas_detach;
> +	struct maple_tree mt_detach;
> +
> +	unsigned long flags;
> +	unsigned long pglen;
> +	unsigned long charged;
> +};
> +
> +#define MMAP_STATE(name, mm_, vmi_, vmg_, uf_, flags_, len_)	\
> +	struct mmap_state name = {				\
> +		.mm = mm_,					\
> +		.vmi = vmi_,					\
> +		.vmg = vmg_,					\
> +		.uf = uf_,					\
> +		.flags = flags_,				\
> +		.pglen = PHYS_PFN(len_),			\
> +	}
> +
>  static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
>  {
>  	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
> @@ -2169,189 +2194,247 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
>  	vms_complete_munmap_vmas(vms, mas_detach);
>  }
> 
> -unsigned long __mmap_region(struct file *file, unsigned long addr,
> -		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> -		struct list_head *uf)
> +/*
> + * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
> + *                    unmapped once the map operation is completed, check limits,
> + *                    account mapping and clean up any pre-existing VMAs.
> + *

nit: formatting seems wrong here?

> + * @map: Mapping state.
> + *
> + * Returns: 0 on success, error code otherwise.
> + */
> +static int __mmap_prepare(struct mmap_state *map)
>  {
> -	struct mm_struct *mm = current->mm;
> -	struct vm_area_struct *vma = NULL;
> -	pgoff_t pglen = PHYS_PFN(len);
> -	unsigned long charged = 0;
> -	struct vma_munmap_struct vms;
> -	struct ma_state mas_detach;
> -	struct maple_tree mt_detach;
> -	unsigned long end = addr + len;
>  	int error;
> -	VMA_ITERATOR(vmi, mm, addr);
> -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> -
> -	vmg.file = file;
> -	/* Find the first overlapping VMA */
> -	vma = vma_find(&vmi, end);
> -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> -	if (vma) {
> -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> -		mt_on_stack(mt_detach);
> -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> +	struct vma_iterator *vmi = map->vmi;
> +	struct vma_merge_struct *vmg = map->vmg;
> +	struct vma_munmap_struct *vms = &map->vms;
> +
> +	/* Find the first overlapping VMA and initialise unmap state. */
> +	vms->vma = vma_find(vmi, vmg->end);
> +	init_vma_munmap(vms, vmi, vms->vma, vmg->start, vmg->end, map->uf,
> +			/* unlock = */ false);
> +
> +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> +	if (vms->vma) {
> +		mt_init_flags(&map->mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);

Nit: line is too long.

> +		mt_on_stack(map->mt_detach);
> +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
>  		/* Prepare to unmap any existing mapping in the area */
> -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
>  		if (error)
> -			goto gather_failed;
> +			return error;

As Vlastimil pointed out, there is an issue just returning the error.

> 
> -		vmg.next = vms.next;
> -		vmg.prev = vms.prev;
> -		vma = NULL;
> +		vmg->next = vms->next;
> +		vmg->prev = vms->prev;
>  	} else {
> -		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
> +		vmg->next = vma_iter_next_rewind(vmi, &vmg->prev);
>  	}
> 
>  	/* Check against address space limit. */
> -	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
> -		error = -ENOMEM;
> -		goto abort_munmap;
> -	}
> +	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
> +		return -ENOMEM;
> 
> -	/*
> -	 * Private writable mapping: check memory availability
> -	 */
> -	if (accountable_mapping(file, vm_flags)) {
> -		charged = pglen;
> -		charged -= vms.nr_accounted;
> -		if (charged) {
> -			error = security_vm_enough_memory_mm(mm, charged);
> +	/* Private writable mapping: check memory availability. */
> +	if (accountable_mapping(vmg->file, map->flags)) {
> +		map->charged = map->pglen;
> +		map->charged -= vms->nr_accounted;
> +		if (map->charged) {
> +			error = security_vm_enough_memory_mm(map->mm, map->charged);
>  			if (error)
> -				goto abort_munmap;
> +				return error;
>  		}
> 
> -		vms.nr_accounted = 0;
> -		vm_flags |= VM_ACCOUNT;
> -		vmg.flags = vm_flags;
> +		vms->nr_accounted = 0;
> +		map->flags |= VM_ACCOUNT;
>  	}
> 
>  	/*
> -	 * clear PTEs while the vma is still in the tree so that rmap
> +	 * Clear PTEs while the vma is still in the tree so that rmap
>  	 * cannot race with the freeing later in the truncate scenario.
>  	 * This is also needed for mmap_file(), which is why vm_ops
>  	 * close function is called.
>  	 */
> -	vms_clean_up_area(&vms, &mas_detach);
> -	vma = vma_merge_new_range(&vmg);
> -	if (vma)
> -		goto expanded;
> +	vms_clean_up_area(vms, &map->mas_detach);
> +
> +	return 0;
> +}
> +
> +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> +			       struct vm_area_struct **mergep)
> +{
> +	struct vma_iterator *vmi = map->vmi;
> +	struct vma_merge_struct *vmg = map->vmg;
> +	int error;
> +
> +	vma->vm_file = get_file(vmg->file);
> +	error = mmap_file(vma->vm_file, vma);
> +	if (error) {
> +		fput(vma->vm_file);
> +		vma->vm_file = NULL;
> +
> +		vma_iter_set(vmi, vma->vm_end);
> +		/* Undo any partial mapping done by a device driver. */
> +		unmap_region(&vmi->mas, vma, vmg->prev, vmg->next);
> +
> +		return error;
> +	}
> +
> +	/* Drivers cannot alter the address of the VMA. */
> +	WARN_ON_ONCE(vmg->start != vma->vm_start);
> +	/*
> +	 * Drivers should not permit writability when previously it was
> +	 * disallowed.
> +	 */
> +	VM_WARN_ON_ONCE(map->flags != vma->vm_flags &&
> +			!(map->flags & VM_MAYWRITE) &&
> +			(vma->vm_flags & VM_MAYWRITE));
> +
> +	vma_iter_config(vmi, vmg->start, vmg->end);
> +	/*
> +	 * If flags changed after mmap_file(), we should try merge
> +	 * vma again as we may succeed this time.
> +	 */
> +	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> +		struct vm_area_struct *merge;
> +
> +		vmg->flags = vma->vm_flags;
> +		/* If this fails, state is reset ready for a reattempt. */
> +		merge = vma_merge_new_range(vmg);
> +
> +		if (merge) {
> +			/*
> +			 * ->mmap() can change vma->vm_file and fput
> +			 * the original file. So fput the vma->vm_file
> +			 * here or we would add an extra fput for file
> +			 * and cause general protection fault
> +			 * ultimately.
> +			 */
> +			fput(vma->vm_file);
> +			vm_area_free(vma);
> +			vma_iter_free(vmi);
> +			*mergep = merge;
> +		} else {
> +			vma_iter_config(vmi, vmg->start, vmg->end);
> +		}
> +	}
> +
> +	map->flags = vma->vm_flags;
> +	return 0;
> +}
> +
> +/*
> + * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
> + *                    possible.
> + *
> + *                    An exception to this is if the mapping is file-backed, and
> + *                    the underlying driver changes the VMA flags, permitting a
> + *                    subsequent merge of the VMA, in which case the returned
> + *                    VMA is one that was merged on a second attempt.

It seems all the descriptions have indented lines.

> + *
> + * @map:  Mapping state.
> + * @vmap: Output pointer for the new VMA.
> + *
> + * Returns: Zero on success, or an error.
> + */
> +static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> +{
> +	struct vma_iterator *vmi = map->vmi;
> +	struct vma_merge_struct *vmg = map->vmg;
> +	struct vm_area_struct *merge = NULL;
> +	int error = 0;
> +	struct vm_area_struct *vma;
> +
>  	/*
>  	 * Determine the object being mapped and call the appropriate
>  	 * specific mapper. the address has already been validated, but
>  	 * not unmapped, but the maps are removed from the list.
>  	 */
> -	vma = vm_area_alloc(mm);
> -	if (!vma) {
> -		error = -ENOMEM;
> -		goto unacct_error;
> -	}
> +	vma = vm_area_alloc(map->mm);
> +	if (!vma)
> +		return -ENOMEM;
> 
> -	vma_iter_config(&vmi, addr, end);
> -	vma_set_range(vma, addr, end, pgoff);
> -	vm_flags_init(vma, vm_flags);
> -	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> +	vma_iter_config(vmi, vmg->start, vmg->end);

This function is only called from __mmap_region() after an
__mmap_prepare() and vma_merge_new_range().  The state should be fine,
so maybe this could be WARN_ONs?  Although, it's probably safer to just
leave it.

> +	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
> +	vm_flags_init(vma, map->flags);
> +	vma->vm_page_prot = vm_get_page_prot(map->flags);
> 
> -	if (vma_iter_prealloc(&vmi, vma)) {
> +	if (vma_iter_prealloc(vmi, vma)) {
>  		error = -ENOMEM;
>  		goto free_vma;
>  	}
> 
> -	if (file) {
> -		vma->vm_file = get_file(file);
> -		error = mmap_file(file, vma);
> -		if (error)
> -			goto unmap_and_free_file_vma;
> -
> -		/* Drivers cannot alter the address of the VMA. */
> -		WARN_ON_ONCE(addr != vma->vm_start);
> -		/*
> -		 * Drivers should not permit writability when previously it was
> -		 * disallowed.
> -		 */
> -		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> -				!(vm_flags & VM_MAYWRITE) &&
> -				(vma->vm_flags & VM_MAYWRITE));
> -
> -		vma_iter_config(&vmi, addr, end);
> -		/*
> -		 * If vm_flags changed after mmap_file(), we should try merge
> -		 * vma again as we may succeed this time.
> -		 */
> -		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> -			struct vm_area_struct *merge;
> -
> -			vmg.flags = vma->vm_flags;
> -			/* If this fails, state is reset ready for a reattempt. */
> -			merge = vma_merge_new_range(&vmg);
> -
> -			if (merge) {
> -				/*
> -				 * ->mmap() can change vma->vm_file and fput
> -				 * the original file. So fput the vma->vm_file
> -				 * here or we would add an extra fput for file
> -				 * and cause general protection fault
> -				 * ultimately.
> -				 */
> -				fput(vma->vm_file);
> -				vm_area_free(vma);
> -				vma_iter_free(&vmi);
> -				vma = merge;
> -				/* Update vm_flags to pick up the change. */
> -				vm_flags = vma->vm_flags;
> -				goto file_expanded;
> -			}
> -			vma_iter_config(&vmi, addr, end);
> -		}
> -
> -		vm_flags = vma->vm_flags;
> -	} else if (vm_flags & VM_SHARED) {
> +	if (vmg->file)
> +		error = __mmap_new_file_vma(map, vma, &merge);
> +	else if (map->flags & VM_SHARED)
>  		error = shmem_zero_setup(vma);
> -		if (error)
> -			goto free_iter_vma;
> -	} else {
> +	else
>  		vma_set_anonymous(vma);
> -	}
> +
> +	if (error)
> +		goto free_iter_vma;
> +
> +	if (merge)
> +		goto file_expanded;
> 
>  #ifdef CONFIG_SPARC64
>  	/* TODO: Fix SPARC ADI! */
> -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> +	WARN_ON_ONCE(!arch_validate_flags(map->flags));
>  #endif
> 
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
> -	vma_iter_store(&vmi, vma);
> -	mm->map_count++;
> +	vma_iter_store(vmi, vma);
> +	map->mm->map_count++;
>  	vma_link_file(vma);
> 
>  	/*
>  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
>  	 * call covers the non-merge case.
>  	 */
> -	khugepaged_enter_vma(vma, vma->vm_flags);
> +	khugepaged_enter_vma(vma, map->flags);
> 
>  file_expanded:
> -	file = vma->vm_file;
>  	ksm_add_vma(vma);

__mmap_new_file_vma() may free the vma.  I assume this is what you
mentioned elsewhere about UAF.

> -expanded:
> +

Extra whitespace.

> +	*vmap = vma;
> +	return 0;
> +
> +free_iter_vma:
> +	vma_iter_free(vmi);
> +free_vma:
> +	vm_area_free(vma);
> +	return error;
> +}
> +
> +/*
> + * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
> + *                     statistics, handle locking and finalise the VMA.
> + *
> + * @map: Mapping state.
> + * @vma: Merged or newly allocated VMA for the mmap()'d region.
> + */
> +static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = map->mm;
> +	unsigned long vm_flags = vma->vm_flags;
> +
>  	perf_event_mmap(vma);
> 
> -	/* Unmap any existing mapping in the area */
> -	vms_complete_munmap_vmas(&vms, &mas_detach);
> +	/* Unmap any existing mapping in the area. */
> +	vms_complete_munmap_vmas(&map->vms, &map->mas_detach);
> 
> -	vm_stat_account(mm, vm_flags, pglen);
> +	vm_stat_account(mm, vma->vm_flags, map->pglen);
>  	if (vm_flags & VM_LOCKED) {
>  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>  					is_vm_hugetlb_page(vma) ||
> -					vma == get_gate_vma(current->mm))
> +					vma == get_gate_vma(mm))
>  			vm_flags_clear(vma, VM_LOCKED_MASK);
>  		else
> -			mm->locked_vm += pglen;
> +			mm->locked_vm += map->pglen;
>  	}
> 
> -	if (file)
> +	if (vma->vm_file)
>  		uprobe_mmap(vma);
> 
>  	/*
> @@ -2364,26 +2447,43 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
>  	vm_flags_set(vma, VM_SOFTDIRTY);
> 
>  	vma_set_page_prot(vma);
> +}
> 
> -	return addr;
> +unsigned long __mmap_region(struct file *file, unsigned long addr,
> +		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> +		struct list_head *uf)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma;
> +	int error;
> +	VMA_ITERATOR(vmi, mm, addr);
> +	VMG_STATE(vmg, mm, &vmi, addr, addr + len, vm_flags, pgoff);
> +	MMAP_STATE(map, mm, &vmi, &vmg, uf, vm_flags, len);
> 
> -unmap_and_free_file_vma:
> -	fput(vma->vm_file);
> -	vma->vm_file = NULL;
> +	vmg.file = file;
> 
> -	vma_iter_set(&vmi, vma->vm_end);
> -	/* Undo any partial mapping done by a device driver. */
> -	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> -free_iter_vma:
> -	vma_iter_free(&vmi);
> -free_vma:
> -	vm_area_free(vma);
> -unacct_error:
> -	if (charged)
> -		vm_unacct_memory(charged);
> +	error = __mmap_prepare(&map);
> +	if (error)
> +		goto abort_munmap;
> +
> +	/* Attempt to merge with adjacent VMAs... */
> +	vmg.flags = map.flags;
> +	vma = vma_merge_new_range(&vmg);
> +	if (!vma) {
> +		/* ...but if we can't, allocate a new VMA. */
> +		error = __mmap_new_vma(&map, &vma);
> +		if (error)
> +			goto unacct_error;
> +	}
> +
> +	__mmap_complete(&map, vma);
> 
> +	return addr;
> +
> +unacct_error:
> +	if (map.charged)
> +		vm_unacct_memory(map.charged);

So this is effectively undoing __mmap_prepare()'s accounting. I don't
have a better label for it, but it's not obvious by the label that the
accounting was done in __mmap_prepare().

>  abort_munmap:
> -	vms_abort_munmap_vmas(&vms, &mas_detach);
> -gather_failed:
> +	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
>  	return error;
>  }
> --
> 2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region()
  2024-10-23 14:38   ` Vlastimil Babka
  2024-10-23 15:21     ` Liam R. Howlett
@ 2024-10-23 17:30     ` Lorenzo Stoakes
  1 sibling, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 17:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 04:38:46PM +0200, Vlastimil Babka wrote:
> On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > We have seen bugs and resource leaks arise from the complexity of the
> > __mmap_region() function. This, and the generally deeply fragile error
> > handling logic and complexity which makes understanding the function
> > difficult make it highly desirable to refactor it into something readable.
> >
> > Achieve this by separating the function into smaller logical parts which
> > are easier to understand and follow, and which importantly very
> > significantly simplify the error handling.
> >
> > Note that we now call vms_abort_munmap_vmas() in more error paths than we
> > used to, however in cases where no abort need occur, vms->nr_pages will be
> > equal to zero and we simply exit this function without doing more than we
> > would have done previously.
> >
> > Importantly, the invocation of the driver mmap hook via mmap_file() now has
> > very simple and obvious handling (this was previously the most problematic
> > part of the mmap() operation).
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  mm/vma.c | 380 +++++++++++++++++++++++++++++++++++--------------------
> >  1 file changed, 240 insertions(+), 140 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 7617f9d50d62..a271e2b406ab 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -7,6 +7,31 @@
> >  #include "vma_internal.h"
> >  #include "vma.h"
> >
> > +struct mmap_state {
> > +	struct mm_struct *mm;
> > +	struct vma_iterator *vmi;
> > +	struct vma_merge_struct *vmg;
> > +	struct list_head *uf;
> > +
> > +	struct vma_munmap_struct vms;
> > +	struct ma_state mas_detach;
> > +	struct maple_tree mt_detach;
> > +
> > +	unsigned long flags;
> > +	unsigned long pglen;
> > +	unsigned long charged;
> > +};
> > +
> > +#define MMAP_STATE(name, mm_, vmi_, vmg_, uf_, flags_, len_)	\
> > +	struct mmap_state name = {				\
> > +		.mm = mm_,					\
> > +		.vmi = vmi_,					\
> > +		.vmg = vmg_,					\
> > +		.uf = uf_,					\
> > +		.flags = flags_,				\
> > +		.pglen = PHYS_PFN(len_),			\
> > +	}
> > +
> >  static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
> >  {
> >  	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
> > @@ -2169,189 +2194,247 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> >  	vms_complete_munmap_vmas(vms, mas_detach);
> >  }
> >
> > -unsigned long __mmap_region(struct file *file, unsigned long addr,
> > -		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > -		struct list_head *uf)
> > +/*
> > + * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
> > + *                    unmapped once the map operation is completed, check limits,
> > + *                    account mapping and clean up any pre-existing VMAs.
> > + *
> > + * @map: Mapping state.
> > + *
> > + * Returns: 0 on success, error code otherwise.
> > + */
> > +static int __mmap_prepare(struct mmap_state *map)
> >  {
> > -	struct mm_struct *mm = current->mm;
> > -	struct vm_area_struct *vma = NULL;
> > -	pgoff_t pglen = PHYS_PFN(len);
> > -	unsigned long charged = 0;
> > -	struct vma_munmap_struct vms;
> > -	struct ma_state mas_detach;
> > -	struct maple_tree mt_detach;
> > -	unsigned long end = addr + len;
> >  	int error;
> > -	VMA_ITERATOR(vmi, mm, addr);
> > -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > -
> > -	vmg.file = file;
> > -	/* Find the first overlapping VMA */
> > -	vma = vma_find(&vmi, end);
> > -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > -	if (vma) {
> > -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > -		mt_on_stack(mt_detach);
> > -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	struct vma_munmap_struct *vms = &map->vms;
> > +
> > +	/* Find the first overlapping VMA and initialise unmap state. */
> > +	vms->vma = vma_find(vmi, vmg->end);
> > +	init_vma_munmap(vms, vmi, vms->vma, vmg->start, vmg->end, map->uf,
> > +			/* unlock = */ false);
> > +
> > +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> > +	if (vms->vma) {
> > +		mt_init_flags(&map->mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > +		mt_on_stack(map->mt_detach);
> > +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
> >  		/* Prepare to unmap any existing mapping in the area */
> > -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> > +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
> >  		if (error)
> > -			goto gather_failed;
> > +			return error;
>
> So this assumes vms_abort_munmap_vmas() will rely on the "vms->nr_pages will
> be equal to zero" mentioned in commit log. But AFAICS
> vms_gather_munmap_vmas() can fail in Nth iteration of its
> for_each_vma_range() after some iterations already increased nr_pages and it
> will do a reattach_vmas() and return the error and we just return the error
> here.
> I think either here or maybe in vms_gather_munmap_vmas() itself a reset of
> vms->nr_pages to zero on error should happen for the vms_abort_munmap_vmas()
> to be a no-op?

Ugh yup, I had wrongly assumed this would not be the case but there we go,
makes the point as to what this whole series is about... will fix.

>
> >
> > -		vmg.next = vms.next;
> > -		vmg.prev = vms.prev;
> > -		vma = NULL;
> > +		vmg->next = vms->next;
> > +		vmg->prev = vms->prev;
> >  	} else {
> > -		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
> > +		vmg->next = vma_iter_next_rewind(vmi, &vmg->prev);
> >  	}
> >
> >  	/* Check against address space limit. */
> > -	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
> > -		error = -ENOMEM;
> > -		goto abort_munmap;
> > -	}
> > +	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
> > +		return -ENOMEM;
> >
> > -	/*
> > -	 * Private writable mapping: check memory availability
> > -	 */
> > -	if (accountable_mapping(file, vm_flags)) {
> > -		charged = pglen;
> > -		charged -= vms.nr_accounted;
> > -		if (charged) {
> > -			error = security_vm_enough_memory_mm(mm, charged);
> > +	/* Private writable mapping: check memory availability. */
> > +	if (accountable_mapping(vmg->file, map->flags)) {
> > +		map->charged = map->pglen;
> > +		map->charged -= vms->nr_accounted;
> > +		if (map->charged) {
> > +			error = security_vm_enough_memory_mm(map->mm, map->charged);
> >  			if (error)
> > -				goto abort_munmap;
> > +				return error;
> >  		}
> >
> > -		vms.nr_accounted = 0;
> > -		vm_flags |= VM_ACCOUNT;
> > -		vmg.flags = vm_flags;
> > +		vms->nr_accounted = 0;
> > +		map->flags |= VM_ACCOUNT;
> >  	}
> >
> >  	/*
> > -	 * clear PTEs while the vma is still in the tree so that rmap
> > +	 * Clear PTEs while the vma is still in the tree so that rmap
> >  	 * cannot race with the freeing later in the truncate scenario.
> >  	 * This is also needed for mmap_file(), which is why vm_ops
> >  	 * close function is called.
> >  	 */
> > -	vms_clean_up_area(&vms, &mas_detach);
> > -	vma = vma_merge_new_range(&vmg);
> > -	if (vma)
> > -		goto expanded;
> > +	vms_clean_up_area(vms, &map->mas_detach);
> > +
> > +	return 0;
> > +}
> > +
> > +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> > +			       struct vm_area_struct **mergep)
> > +{
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	int error;
> > +
> > +	vma->vm_file = get_file(vmg->file);
> > +	error = mmap_file(vma->vm_file, vma);
> > +	if (error) {
> > +		fput(vma->vm_file);
> > +		vma->vm_file = NULL;
> > +
> > +		vma_iter_set(vmi, vma->vm_end);
> > +		/* Undo any partial mapping done by a device driver. */
> > +		unmap_region(&vmi->mas, vma, vmg->prev, vmg->next);
> > +
> > +		return error;
> > +	}
> > +
> > +	/* Drivers cannot alter the address of the VMA. */
> > +	WARN_ON_ONCE(vmg->start != vma->vm_start);
> > +	/*
> > +	 * Drivers should not permit writability when previously it was
> > +	 * disallowed.
> > +	 */
> > +	VM_WARN_ON_ONCE(map->flags != vma->vm_flags &&
> > +			!(map->flags & VM_MAYWRITE) &&
> > +			(vma->vm_flags & VM_MAYWRITE));
> > +
> > +	vma_iter_config(vmi, vmg->start, vmg->end);
> > +	/*
> > +	 * If flags changed after mmap_file(), we should try merge
> > +	 * vma again as we may succeed this time.
> > +	 */
> > +	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> > +		struct vm_area_struct *merge;
> > +
> > +		vmg->flags = vma->vm_flags;
> > +		/* If this fails, state is reset ready for a reattempt. */
> > +		merge = vma_merge_new_range(vmg);
> > +
> > +		if (merge) {
> > +			/*
> > +			 * ->mmap() can change vma->vm_file and fput
> > +			 * the original file. So fput the vma->vm_file
> > +			 * here or we would add an extra fput for file
> > +			 * and cause general protection fault
> > +			 * ultimately.
> > +			 */
> > +			fput(vma->vm_file);
> > +			vm_area_free(vma);
>
> This frees the vma.
>
> > +			vma_iter_free(vmi);
> > +			*mergep = merge;
> > +		} else {
> > +			vma_iter_config(vmi, vmg->start, vmg->end);
> > +		}
> > +	}
> > +
> > +	map->flags = vma->vm_flags;
>
> So this is use-after-free.
>
> Maybe pass only a single struct vm_area_struct **vmap parameter to this
> function, and in case of merge, change both vma and *vmap to it?.
>
> Although I can see it's all moot after 8/8. Still let's not introduce a
> temporary UAF step.

Even more vom. Will fix. The irony is I 'fixed' this code to something
'neater' by referencing vma here, rather stupidly.

You can tell it's a rarely used path since I ran a full suite of tests and
didn't hit it.

Again, speaks to the point of this series in general...

>
> > +	return 0;
> > +}
> > +
> > +/*
> > + * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
> > + *                    possible.
> > + *
> > + *                    An exception to this is if the mapping is file-backed, and
> > + *                    the underlying driver changes the VMA flags, permitting a
> > + *                    subsequent merge of the VMA, in which case the returned
> > + *                    VMA is one that was merged on a second attempt.
> > + *
> > + * @map:  Mapping state.
> > + * @vmap: Output pointer for the new VMA.
> > + *
> > + * Returns: Zero on success, or an error.
> > + */
> > +static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > +{
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	struct vm_area_struct *merge = NULL;
> > +	int error = 0;
> > +	struct vm_area_struct *vma;
> > +
> >  	/*
> >  	 * Determine the object being mapped and call the appropriate
> >  	 * specific mapper. the address has already been validated, but
> >  	 * not unmapped, but the maps are removed from the list.
> >  	 */
> > -	vma = vm_area_alloc(mm);
> > -	if (!vma) {
> > -		error = -ENOMEM;
> > -		goto unacct_error;
> > -	}
> > +	vma = vm_area_alloc(map->mm);
> > +	if (!vma)
> > +		return -ENOMEM;
> >
> > -	vma_iter_config(&vmi, addr, end);
> > -	vma_set_range(vma, addr, end, pgoff);
> > -	vm_flags_init(vma, vm_flags);
> > -	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > +	vma_iter_config(vmi, vmg->start, vmg->end);
> > +	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
> > +	vm_flags_init(vma, map->flags);
> > +	vma->vm_page_prot = vm_get_page_prot(map->flags);
> >
> > -	if (vma_iter_prealloc(&vmi, vma)) {
> > +	if (vma_iter_prealloc(vmi, vma)) {
> >  		error = -ENOMEM;
> >  		goto free_vma;
> >  	}
> >
> > -	if (file) {
> > -		vma->vm_file = get_file(file);
> > -		error = mmap_file(file, vma);
> > -		if (error)
> > -			goto unmap_and_free_file_vma;
> > -
> > -		/* Drivers cannot alter the address of the VMA. */
> > -		WARN_ON_ONCE(addr != vma->vm_start);
> > -		/*
> > -		 * Drivers should not permit writability when previously it was
> > -		 * disallowed.
> > -		 */
> > -		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> > -				!(vm_flags & VM_MAYWRITE) &&
> > -				(vma->vm_flags & VM_MAYWRITE));
> > -
> > -		vma_iter_config(&vmi, addr, end);
> > -		/*
> > -		 * If vm_flags changed after mmap_file(), we should try merge
> > -		 * vma again as we may succeed this time.
> > -		 */
> > -		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> > -			struct vm_area_struct *merge;
> > -
> > -			vmg.flags = vma->vm_flags;
> > -			/* If this fails, state is reset ready for a reattempt. */
> > -			merge = vma_merge_new_range(&vmg);
> > -
> > -			if (merge) {
> > -				/*
> > -				 * ->mmap() can change vma->vm_file and fput
> > -				 * the original file. So fput the vma->vm_file
> > -				 * here or we would add an extra fput for file
> > -				 * and cause general protection fault
> > -				 * ultimately.
> > -				 */
> > -				fput(vma->vm_file);
> > -				vm_area_free(vma);
> > -				vma_iter_free(&vmi);
> > -				vma = merge;
> > -				/* Update vm_flags to pick up the change. */
> > -				vm_flags = vma->vm_flags;
> > -				goto file_expanded;
> > -			}
> > -			vma_iter_config(&vmi, addr, end);
> > -		}
> > -
> > -		vm_flags = vma->vm_flags;
> > -	} else if (vm_flags & VM_SHARED) {
> > +	if (vmg->file)
> > +		error = __mmap_new_file_vma(map, vma, &merge);
> > +	else if (map->flags & VM_SHARED)
> >  		error = shmem_zero_setup(vma);
> > -		if (error)
> > -			goto free_iter_vma;
> > -	} else {
> > +	else
> >  		vma_set_anonymous(vma);
> > -	}
> > +
> > +	if (error)
> > +		goto free_iter_vma;
> > +
> > +	if (merge)
> > +		goto file_expanded;
> >
> >  #ifdef CONFIG_SPARC64
> >  	/* TODO: Fix SPARC ADI! */
> > -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> > +	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> >  #endif
> >
> >  	/* Lock the VMA since it is modified after insertion into VMA tree */
> >  	vma_start_write(vma);
> > -	vma_iter_store(&vmi, vma);
> > -	mm->map_count++;
> > +	vma_iter_store(vmi, vma);
> > +	map->mm->map_count++;
> >  	vma_link_file(vma);
> >
> >  	/*
> >  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
> >  	 * call covers the non-merge case.
> >  	 */
> > -	khugepaged_enter_vma(vma, vma->vm_flags);
> > +	khugepaged_enter_vma(vma, map->flags);
> >
> >  file_expanded:
> > -	file = vma->vm_file;
> >  	ksm_add_vma(vma);
> > -expanded:
> > +
> > +	*vmap = vma;
> > +	return 0;
> > +
> > +free_iter_vma:
> > +	vma_iter_free(vmi);
> > +free_vma:
> > +	vm_area_free(vma);
> > +	return error;
> > +}
> > +
> > +/*
> > + * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
> > + *                     statistics, handle locking and finalise the VMA.
> > + *
> > + * @map: Mapping state.
> > + * @vma: Merged or newly allocated VMA for the mmap()'d region.
> > + */
> > +static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> > +{
> > +	struct mm_struct *mm = map->mm;
> > +	unsigned long vm_flags = vma->vm_flags;
> > +
> >  	perf_event_mmap(vma);
> >
> > -	/* Unmap any existing mapping in the area */
> > -	vms_complete_munmap_vmas(&vms, &mas_detach);
> > +	/* Unmap any existing mapping in the area. */
> > +	vms_complete_munmap_vmas(&map->vms, &map->mas_detach);
> >
> > -	vm_stat_account(mm, vm_flags, pglen);
> > +	vm_stat_account(mm, vma->vm_flags, map->pglen);
> >  	if (vm_flags & VM_LOCKED) {
> >  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> >  					is_vm_hugetlb_page(vma) ||
> > -					vma == get_gate_vma(current->mm))
> > +					vma == get_gate_vma(mm))
> >  			vm_flags_clear(vma, VM_LOCKED_MASK);
> >  		else
> > -			mm->locked_vm += pglen;
> > +			mm->locked_vm += map->pglen;
> >  	}
> >
> > -	if (file)
> > +	if (vma->vm_file)
> >  		uprobe_mmap(vma);
> >
> >  	/*
> > @@ -2364,26 +2447,43 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
> >  	vm_flags_set(vma, VM_SOFTDIRTY);
> >
> >  	vma_set_page_prot(vma);
> > +}
> >
> > -	return addr;
> > +unsigned long __mmap_region(struct file *file, unsigned long addr,
> > +		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > +		struct list_head *uf)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct vm_area_struct *vma;
> > +	int error;
> > +	VMA_ITERATOR(vmi, mm, addr);
> > +	VMG_STATE(vmg, mm, &vmi, addr, addr + len, vm_flags, pgoff);
> > +	MMAP_STATE(map, mm, &vmi, &vmg, uf, vm_flags, len);
> >
> > -unmap_and_free_file_vma:
> > -	fput(vma->vm_file);
> > -	vma->vm_file = NULL;
> > +	vmg.file = file;
> >
> > -	vma_iter_set(&vmi, vma->vm_end);
> > -	/* Undo any partial mapping done by a device driver. */
> > -	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > -free_iter_vma:
> > -	vma_iter_free(&vmi);
> > -free_vma:
> > -	vm_area_free(vma);
> > -unacct_error:
> > -	if (charged)
> > -		vm_unacct_memory(charged);
> > +	error = __mmap_prepare(&map);
> > +	if (error)
> > +		goto abort_munmap;
> > +
> > +	/* Attempt to merge with adjacent VMAs... */
> > +	vmg.flags = map.flags;
> > +	vma = vma_merge_new_range(&vmg);
> > +	if (!vma) {
> > +		/* ...but if we can't, allocate a new VMA. */
> > +		error = __mmap_new_vma(&map, &vma);
> > +		if (error)
> > +			goto unacct_error;
> > +	}
> > +
> > +	__mmap_complete(&map, vma);
> >
> > +	return addr;
> > +
> > +unacct_error:
> > +	if (map.charged)
> > +		vm_unacct_memory(map.charged);
> >  abort_munmap:
> > -	vms_abort_munmap_vmas(&vms, &mas_detach);
> > -gather_failed:
> > +	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
> >  	return error;
> >  }
> > --
> > 2.47.0
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region()
  2024-10-23 15:21     ` Liam R. Howlett
@ 2024-10-23 17:39       ` Lorenzo Stoakes
  2024-10-23 18:12         ` Liam R. Howlett
  0 siblings, 1 reply; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 17:39 UTC (permalink / raw)
  To: Liam R. Howlett, Vlastimil Babka, Andrew Morton, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 11:21:54AM -0400, Liam R. Howlett wrote:
> * Vlastimil Babka <vbabka@suse.cz> [241023 10:39]:
> > On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > > We have seen bugs and resource leaks arise from the complexity of the
> > > __mmap_region() function. This, and the generally deeply fragile error
> > > handling logic and complexity which makes understanding the function
> > > difficult make it highly desirable to refactor it into something readable.
> > >
> > > Achieve this by separating the function into smaller logical parts which
> > > are easier to understand and follow, and which importantly very
> > > significantly simplify the error handling.
> > >
> > > Note that we now call vms_abort_munmap_vmas() in more error paths than we
> > > used to, however in cases where no abort need occur, vms->nr_pages will be
> > > equal to zero and we simply exit this function without doing more than we
> > > would have done previously.
> > >
> > > Importantly, the invocation of the driver mmap hook via mmap_file() now has
> > > very simple and obvious handling (this was previously the most problematic
> > > part of the mmap() operation).
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > ---
> > >  mm/vma.c | 380 +++++++++++++++++++++++++++++++++++--------------------
> > >  1 file changed, 240 insertions(+), 140 deletions(-)
> > >
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index 7617f9d50d62..a271e2b406ab 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -7,6 +7,31 @@
> > >  #include "vma_internal.h"
> > >  #include "vma.h"
> > >
> > > +struct mmap_state {
> > > +	struct mm_struct *mm;
> > > +	struct vma_iterator *vmi;
> > > +	struct vma_merge_struct *vmg;
> > > +	struct list_head *uf;
> > > +
> > > +	struct vma_munmap_struct vms;
> > > +	struct ma_state mas_detach;
> > > +	struct maple_tree mt_detach;
> > > +
> > > +	unsigned long flags;
> > > +	unsigned long pglen;
> > > +	unsigned long charged;
> > > +};
> > > +
> > > +#define MMAP_STATE(name, mm_, vmi_, vmg_, uf_, flags_, len_)	\
> > > +	struct mmap_state name = {				\
> > > +		.mm = mm_,					\
> > > +		.vmi = vmi_,					\
> > > +		.vmg = vmg_,					\
> > > +		.uf = uf_,					\
> > > +		.flags = flags_,				\
> > > +		.pglen = PHYS_PFN(len_),			\
> > > +	}
> > > +
> > >  static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
> > >  {
> > >  	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
> > > @@ -2169,189 +2194,247 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> > >  	vms_complete_munmap_vmas(vms, mas_detach);
> > >  }
> > >
> > > -unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > -		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > > -		struct list_head *uf)
> > > +/*
> > > + * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
> > > + *                    unmapped once the map operation is completed, check limits,
> > > + *                    account mapping and clean up any pre-existing VMAs.
> > > + *
> > > + * @map: Mapping state.
> > > + *
> > > + * Returns: 0 on success, error code otherwise.
> > > + */
> > > +static int __mmap_prepare(struct mmap_state *map)
> > >  {
> > > -	struct mm_struct *mm = current->mm;
> > > -	struct vm_area_struct *vma = NULL;
> > > -	pgoff_t pglen = PHYS_PFN(len);
> > > -	unsigned long charged = 0;
> > > -	struct vma_munmap_struct vms;
> > > -	struct ma_state mas_detach;
> > > -	struct maple_tree mt_detach;
> > > -	unsigned long end = addr + len;
> > >  	int error;
> > > -	VMA_ITERATOR(vmi, mm, addr);
> > > -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > > -
> > > -	vmg.file = file;
> > > -	/* Find the first overlapping VMA */
> > > -	vma = vma_find(&vmi, end);
> > > -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > > -	if (vma) {
> > > -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > > -		mt_on_stack(mt_detach);
> > > -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > > +	struct vma_iterator *vmi = map->vmi;
> > > +	struct vma_merge_struct *vmg = map->vmg;
> > > +	struct vma_munmap_struct *vms = &map->vms;
> > > +
> > > +	/* Find the first overlapping VMA and initialise unmap state. */
> > > +	vms->vma = vma_find(vmi, vmg->end);
> > > +	init_vma_munmap(vms, vmi, vms->vma, vmg->start, vmg->end, map->uf,
> > > +			/* unlock = */ false);
> > > +
> > > +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> > > +	if (vms->vma) {
> > > +		mt_init_flags(&map->mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > > +		mt_on_stack(map->mt_detach);
> > > +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
> > >  		/* Prepare to unmap any existing mapping in the area */
> > > -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> > > +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
> > >  		if (error)
> > > -			goto gather_failed;
> > > +			return error;
> >
> > So this assumes vms_abort_munmap_vmas() will rely on the "vms->nr_pages will
> > be equal to zero" mentioned in commit log. But AFAICS
> > vms_gather_munmap_vmas() can fail in Nth iteration of its
> > for_each_vma_range() after some iterations already increased nr_pages and it
> > will do a reattach_vmas() and return the error and we just return the error
> > here.
> > I think either here or maybe in vms_gather_munmap_vmas() itself a reset of
> > vms->nr_pages to zero on error should happen for the vms_abort_munmap_vmas()
> > to be a no-op?
>
> Probably in reattach_vmas()?

Hm, but that only accepts a mas and seems redundant elsewhere... am going for
simply resetting nr_pages for now and maybe we can revisit if needs be?

>
> >
> > >
> > > -		vmg.next = vms.next;
> > > -		vmg.prev = vms.prev;
> > > -		vma = NULL;
> > > +		vmg->next = vms->next;
> > > +		vmg->prev = vms->prev;
> > >  	} else {
> > > -		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
> > > +		vmg->next = vma_iter_next_rewind(vmi, &vmg->prev);
> > >  	}
> > >
> > >  	/* Check against address space limit. */
> > > -	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
> > > -		error = -ENOMEM;
> > > -		goto abort_munmap;
> > > -	}
> > > +	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
> > > +		return -ENOMEM;
> > >
> > > -	/*
> > > -	 * Private writable mapping: check memory availability
> > > -	 */
> > > -	if (accountable_mapping(file, vm_flags)) {
> > > -		charged = pglen;
> > > -		charged -= vms.nr_accounted;
> > > -		if (charged) {
> > > -			error = security_vm_enough_memory_mm(mm, charged);
> > > +	/* Private writable mapping: check memory availability. */
> > > +	if (accountable_mapping(vmg->file, map->flags)) {
> > > +		map->charged = map->pglen;
> > > +		map->charged -= vms->nr_accounted;
> > > +		if (map->charged) {
> > > +			error = security_vm_enough_memory_mm(map->mm, map->charged);
> > >  			if (error)
> > > -				goto abort_munmap;
> > > +				return error;
> > >  		}
> > >
> > > -		vms.nr_accounted = 0;
> > > -		vm_flags |= VM_ACCOUNT;
> > > -		vmg.flags = vm_flags;
> > > +		vms->nr_accounted = 0;
> > > +		map->flags |= VM_ACCOUNT;
> > >  	}
> > >
> > >  	/*
> > > -	 * clear PTEs while the vma is still in the tree so that rmap
> > > +	 * Clear PTEs while the vma is still in the tree so that rmap
> > >  	 * cannot race with the freeing later in the truncate scenario.
> > >  	 * This is also needed for mmap_file(), which is why vm_ops
> > >  	 * close function is called.
> > >  	 */
> > > -	vms_clean_up_area(&vms, &mas_detach);
> > > -	vma = vma_merge_new_range(&vmg);
> > > -	if (vma)
> > > -		goto expanded;
> > > +	vms_clean_up_area(vms, &map->mas_detach);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> > > +			       struct vm_area_struct **mergep)
> > > +{
> > > +	struct vma_iterator *vmi = map->vmi;
> > > +	struct vma_merge_struct *vmg = map->vmg;
> > > +	int error;
> > > +
> > > +	vma->vm_file = get_file(vmg->file);
> > > +	error = mmap_file(vma->vm_file, vma);
> > > +	if (error) {
> > > +		fput(vma->vm_file);
> > > +		vma->vm_file = NULL;
> > > +
> > > +		vma_iter_set(vmi, vma->vm_end);
> > > +		/* Undo any partial mapping done by a device driver. */
> > > +		unmap_region(&vmi->mas, vma, vmg->prev, vmg->next);
> > > +
> > > +		return error;
> > > +	}
> > > +
> > > +	/* Drivers cannot alter the address of the VMA. */
> > > +	WARN_ON_ONCE(vmg->start != vma->vm_start);
> > > +	/*
> > > +	 * Drivers should not permit writability when previously it was
> > > +	 * disallowed.
> > > +	 */
> > > +	VM_WARN_ON_ONCE(map->flags != vma->vm_flags &&
> > > +			!(map->flags & VM_MAYWRITE) &&
> > > +			(vma->vm_flags & VM_MAYWRITE));
> > > +
> > > +	vma_iter_config(vmi, vmg->start, vmg->end);
> > > +	/*
> > > +	 * If flags changed after mmap_file(), we should try merge
> > > +	 * vma again as we may succeed this time.
> > > +	 */
> > > +	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> > > +		struct vm_area_struct *merge;
> > > +
> > > +		vmg->flags = vma->vm_flags;
> > > +		/* If this fails, state is reset ready for a reattempt. */
> > > +		merge = vma_merge_new_range(vmg);
> > > +
> > > +		if (merge) {
> > > +			/*
> > > +			 * ->mmap() can change vma->vm_file and fput
> > > +			 * the original file. So fput the vma->vm_file
> > > +			 * here or we would add an extra fput for file
> > > +			 * and cause general protection fault
> > > +			 * ultimately.
> > > +			 */
> > > +			fput(vma->vm_file);
> > > +			vm_area_free(vma);
> >
> > This frees the vma.
> >
> > > +			vma_iter_free(vmi);
> > > +			*mergep = merge;
> > > +		} else {
> > > +			vma_iter_config(vmi, vmg->start, vmg->end);
> > > +		}
> > > +	}
> > > +
> > > +	map->flags = vma->vm_flags;
> >
> > So this is use-after-free.
> >
> > Maybe pass only a single struct vm_area_struct **vmap parameter to this
> > function, and in case of merge, change both vma and *vmap to it?.
> >
> > Although I can see it's all moot after 8/8. Still let's not introduce a
> > temporary UAF step.
> >
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
> > > + *                    possible.
> > > + *
> > > + *                    An exception to this is if the mapping is file-backed, and
> > > + *                    the underlying driver changes the VMA flags, permitting a
> > > + *                    subsequent merge of the VMA, in which case the returned
> > > + *                    VMA is one that was merged on a second attempt.
> > > + *
> > > + * @map:  Mapping state.
> > > + * @vmap: Output pointer for the new VMA.
> > > + *
> > > + * Returns: Zero on success, or an error.
> > > + */
> > > +static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > > +{
> > > +	struct vma_iterator *vmi = map->vmi;
> > > +	struct vma_merge_struct *vmg = map->vmg;
> > > +	struct vm_area_struct *merge = NULL;
> > > +	int error = 0;
> > > +	struct vm_area_struct *vma;
> > > +
> > >  	/*
> > >  	 * Determine the object being mapped and call the appropriate
> > >  	 * specific mapper. the address has already been validated, but
> > >  	 * not unmapped, but the maps are removed from the list.
> > >  	 */
> > > -	vma = vm_area_alloc(mm);
> > > -	if (!vma) {
> > > -		error = -ENOMEM;
> > > -		goto unacct_error;
> > > -	}
> > > +	vma = vm_area_alloc(map->mm);
> > > +	if (!vma)
> > > +		return -ENOMEM;
> > >
> > > -	vma_iter_config(&vmi, addr, end);
> > > -	vma_set_range(vma, addr, end, pgoff);
> > > -	vm_flags_init(vma, vm_flags);
> > > -	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > > +	vma_iter_config(vmi, vmg->start, vmg->end);
> > > +	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
> > > +	vm_flags_init(vma, map->flags);
> > > +	vma->vm_page_prot = vm_get_page_prot(map->flags);
> > >
> > > -	if (vma_iter_prealloc(&vmi, vma)) {
> > > +	if (vma_iter_prealloc(vmi, vma)) {
> > >  		error = -ENOMEM;
> > >  		goto free_vma;
> > >  	}
> > >
> > > -	if (file) {
> > > -		vma->vm_file = get_file(file);
> > > -		error = mmap_file(file, vma);
> > > -		if (error)
> > > -			goto unmap_and_free_file_vma;
> > > -
> > > -		/* Drivers cannot alter the address of the VMA. */
> > > -		WARN_ON_ONCE(addr != vma->vm_start);
> > > -		/*
> > > -		 * Drivers should not permit writability when previously it was
> > > -		 * disallowed.
> > > -		 */
> > > -		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> > > -				!(vm_flags & VM_MAYWRITE) &&
> > > -				(vma->vm_flags & VM_MAYWRITE));
> > > -
> > > -		vma_iter_config(&vmi, addr, end);
> > > -		/*
> > > -		 * If vm_flags changed after mmap_file(), we should try merge
> > > -		 * vma again as we may succeed this time.
> > > -		 */
> > > -		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> > > -			struct vm_area_struct *merge;
> > > -
> > > -			vmg.flags = vma->vm_flags;
> > > -			/* If this fails, state is reset ready for a reattempt. */
> > > -			merge = vma_merge_new_range(&vmg);
> > > -
> > > -			if (merge) {
> > > -				/*
> > > -				 * ->mmap() can change vma->vm_file and fput
> > > -				 * the original file. So fput the vma->vm_file
> > > -				 * here or we would add an extra fput for file
> > > -				 * and cause general protection fault
> > > -				 * ultimately.
> > > -				 */
> > > -				fput(vma->vm_file);
> > > -				vm_area_free(vma);
> > > -				vma_iter_free(&vmi);
> > > -				vma = merge;
> > > -				/* Update vm_flags to pick up the change. */
> > > -				vm_flags = vma->vm_flags;
> > > -				goto file_expanded;
> > > -			}
> > > -			vma_iter_config(&vmi, addr, end);
> > > -		}
> > > -
> > > -		vm_flags = vma->vm_flags;
> > > -	} else if (vm_flags & VM_SHARED) {
> > > +	if (vmg->file)
> > > +		error = __mmap_new_file_vma(map, vma, &merge);
> > > +	else if (map->flags & VM_SHARED)
> > >  		error = shmem_zero_setup(vma);
> > > -		if (error)
> > > -			goto free_iter_vma;
> > > -	} else {
> > > +	else
> > >  		vma_set_anonymous(vma);
> > > -	}
> > > +
> > > +	if (error)
> > > +		goto free_iter_vma;
> > > +
> > > +	if (merge)
> > > +		goto file_expanded;
> > >
> > >  #ifdef CONFIG_SPARC64
> > >  	/* TODO: Fix SPARC ADI! */
> > > -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> > > +	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> > >  #endif
> > >
> > >  	/* Lock the VMA since it is modified after insertion into VMA tree */
> > >  	vma_start_write(vma);
> > > -	vma_iter_store(&vmi, vma);
> > > -	mm->map_count++;
> > > +	vma_iter_store(vmi, vma);
> > > +	map->mm->map_count++;
> > >  	vma_link_file(vma);
> > >
> > >  	/*
> > >  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
> > >  	 * call covers the non-merge case.
> > >  	 */
> > > -	khugepaged_enter_vma(vma, vma->vm_flags);
> > > +	khugepaged_enter_vma(vma, map->flags);
> > >
> > >  file_expanded:
> > > -	file = vma->vm_file;
> > >  	ksm_add_vma(vma);
> > > -expanded:
> > > +
> > > +	*vmap = vma;
> > > +	return 0;
> > > +
> > > +free_iter_vma:
> > > +	vma_iter_free(vmi);
> > > +free_vma:
> > > +	vm_area_free(vma);
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
> > > + *                     statistics, handle locking and finalise the VMA.
> > > + *
> > > + * @map: Mapping state.
> > > + * @vma: Merged or newly allocated VMA for the mmap()'d region.
> > > + */
> > > +static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> > > +{
> > > +	struct mm_struct *mm = map->mm;
> > > +	unsigned long vm_flags = vma->vm_flags;
> > > +
> > >  	perf_event_mmap(vma);
> > >
> > > -	/* Unmap any existing mapping in the area */
> > > -	vms_complete_munmap_vmas(&vms, &mas_detach);
> > > +	/* Unmap any existing mapping in the area. */
> > > +	vms_complete_munmap_vmas(&map->vms, &map->mas_detach);
> > >
> > > -	vm_stat_account(mm, vm_flags, pglen);
> > > +	vm_stat_account(mm, vma->vm_flags, map->pglen);
> > >  	if (vm_flags & VM_LOCKED) {
> > >  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> > >  					is_vm_hugetlb_page(vma) ||
> > > -					vma == get_gate_vma(current->mm))
> > > +					vma == get_gate_vma(mm))
> > >  			vm_flags_clear(vma, VM_LOCKED_MASK);
> > >  		else
> > > -			mm->locked_vm += pglen;
> > > +			mm->locked_vm += map->pglen;
> > >  	}
> > >
> > > -	if (file)
> > > +	if (vma->vm_file)
> > >  		uprobe_mmap(vma);
> > >
> > >  	/*
> > > @@ -2364,26 +2447,43 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
> > >  	vm_flags_set(vma, VM_SOFTDIRTY);
> > >
> > >  	vma_set_page_prot(vma);
> > > +}
> > >
> > > -	return addr;
> > > +unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > +		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > > +		struct list_head *uf)
> > > +{
> > > +	struct mm_struct *mm = current->mm;
> > > +	struct vm_area_struct *vma;
> > > +	int error;
> > > +	VMA_ITERATOR(vmi, mm, addr);
> > > +	VMG_STATE(vmg, mm, &vmi, addr, addr + len, vm_flags, pgoff);
> > > +	MMAP_STATE(map, mm, &vmi, &vmg, uf, vm_flags, len);
> > >
> > > -unmap_and_free_file_vma:
> > > -	fput(vma->vm_file);
> > > -	vma->vm_file = NULL;
> > > +	vmg.file = file;
> > >
> > > -	vma_iter_set(&vmi, vma->vm_end);
> > > -	/* Undo any partial mapping done by a device driver. */
> > > -	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > > -free_iter_vma:
> > > -	vma_iter_free(&vmi);
> > > -free_vma:
> > > -	vm_area_free(vma);
> > > -unacct_error:
> > > -	if (charged)
> > > -		vm_unacct_memory(charged);
> > > +	error = __mmap_prepare(&map);
> > > +	if (error)
> > > +		goto abort_munmap;
> > > +
> > > +	/* Attempt to merge with adjacent VMAs... */
> > > +	vmg.flags = map.flags;
> > > +	vma = vma_merge_new_range(&vmg);
> > > +	if (!vma) {
> > > +		/* ...but if we can't, allocate a new VMA. */
> > > +		error = __mmap_new_vma(&map, &vma);
> > > +		if (error)
> > > +			goto unacct_error;
> > > +	}
> > > +
> > > +	__mmap_complete(&map, vma);
> > >
> > > +	return addr;
> > > +
> > > +unacct_error:
> > > +	if (map.charged)
> > > +		vm_unacct_memory(map.charged);
> > >  abort_munmap:
> > > -	vms_abort_munmap_vmas(&vms, &mas_detach);
> > > -gather_failed:
> > > +	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
> > >  	return error;
> > >  }
> > > --
> > > 2.47.0
> >


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region()
  2024-10-23 17:19   ` Liam R. Howlett
@ 2024-10-23 17:52     ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 17:52 UTC (permalink / raw)
  To: Liam R. Howlett, Andrew Morton, Vlastimil Babka, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 01:19:35PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241022 16:41]:
> > We have seen bugs and resource leaks arise from the complexity of the
> > __mmap_region() function. This, and the generally deeply fragile error
> > handling logic and complexity which makes understanding the function
> > difficult make it highly desirable to refactor it into something readable.
> >
> > Achieve this by separating the function into smaller logical parts which
> > are easier to understand and follow, and which importantly very
> > significantly simplify the error handling.
> >
> > Note that we now call vms_abort_munmap_vmas() in more error paths than we
> > used to, however in cases where no abort need occur, vms->nr_pages will be
> > equal to zero and we simply exit this function without doing more than we
> > would have done previously.
> >
> > Importantly, the invocation of the driver mmap hook via mmap_file() now has
> > very simple and obvious handling (this was previously the most problematic
> > part of the mmap() operation).
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  mm/vma.c | 380 +++++++++++++++++++++++++++++++++++--------------------
> >  1 file changed, 240 insertions(+), 140 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 7617f9d50d62..a271e2b406ab 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -7,6 +7,31 @@
> >  #include "vma_internal.h"
> >  #include "vma.h"
> >
> > +struct mmap_state {
> > +	struct mm_struct *mm;
> > +	struct vma_iterator *vmi;
> > +	struct vma_merge_struct *vmg;
> > +	struct list_head *uf;
> > +
> > +	struct vma_munmap_struct vms;
> > +	struct ma_state mas_detach;
> > +	struct maple_tree mt_detach;
> > +
> > +	unsigned long flags;
> > +	unsigned long pglen;
> > +	unsigned long charged;
> > +};
> > +
> > +#define MMAP_STATE(name, mm_, vmi_, vmg_, uf_, flags_, len_)	\
> > +	struct mmap_state name = {				\
> > +		.mm = mm_,					\
> > +		.vmi = vmi_,					\
> > +		.vmg = vmg_,					\
> > +		.uf = uf_,					\
> > +		.flags = flags_,				\
> > +		.pglen = PHYS_PFN(len_),			\
> > +	}
> > +
> >  static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
> >  {
> >  	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
> > @@ -2169,189 +2194,247 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> >  	vms_complete_munmap_vmas(vms, mas_detach);
> >  }
> >
> > -unsigned long __mmap_region(struct file *file, unsigned long addr,
> > -		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > -		struct list_head *uf)
> > +/*
> > + * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
> > + *                    unmapped once the map operation is completed, check limits,
> > + *                    account mapping and clean up any pre-existing VMAs.
> > + *
>
> nit: formatting seems wrong here?

But I like it this way :( will change though.

>
> > + * @map: Mapping state.
> > + *
> > + * Returns: 0 on success, error code otherwise.
> > + */
> > +static int __mmap_prepare(struct mmap_state *map)
> >  {
> > -	struct mm_struct *mm = current->mm;
> > -	struct vm_area_struct *vma = NULL;
> > -	pgoff_t pglen = PHYS_PFN(len);
> > -	unsigned long charged = 0;
> > -	struct vma_munmap_struct vms;
> > -	struct ma_state mas_detach;
> > -	struct maple_tree mt_detach;
> > -	unsigned long end = addr + len;
> >  	int error;
> > -	VMA_ITERATOR(vmi, mm, addr);
> > -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > -
> > -	vmg.file = file;
> > -	/* Find the first overlapping VMA */
> > -	vma = vma_find(&vmi, end);
> > -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > -	if (vma) {
> > -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > -		mt_on_stack(mt_detach);
> > -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	struct vma_munmap_struct *vms = &map->vms;
> > +
> > +	/* Find the first overlapping VMA and initialise unmap state. */
> > +	vms->vma = vma_find(vmi, vmg->end);
> > +	init_vma_munmap(vms, vmi, vms->vma, vmg->start, vmg->end, map->uf,
> > +			/* unlock = */ false);
> > +
> > +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> > +	if (vms->vma) {
> > +		mt_init_flags(&map->mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
>
> Nit: line is too long.

Yeah think this is possibly pre-existing but will fix either way.

>
> > +		mt_on_stack(map->mt_detach);
> > +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
> >  		/* Prepare to unmap any existing mapping in the area */
> > -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> > +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
> >  		if (error)
> > -			goto gather_failed;
> > +			return error;
>
> As Vlastimil pointed out, there is an issue just returning the error.

Yeah have addressed it, thanks!

>
> >
> > -		vmg.next = vms.next;
> > -		vmg.prev = vms.prev;
> > -		vma = NULL;
> > +		vmg->next = vms->next;
> > +		vmg->prev = vms->prev;
> >  	} else {
> > -		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
> > +		vmg->next = vma_iter_next_rewind(vmi, &vmg->prev);
> >  	}
> >
> >  	/* Check against address space limit. */
> > -	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
> > -		error = -ENOMEM;
> > -		goto abort_munmap;
> > -	}
> > +	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
> > +		return -ENOMEM;
> >
> > -	/*
> > -	 * Private writable mapping: check memory availability
> > -	 */
> > -	if (accountable_mapping(file, vm_flags)) {
> > -		charged = pglen;
> > -		charged -= vms.nr_accounted;
> > -		if (charged) {
> > -			error = security_vm_enough_memory_mm(mm, charged);
> > +	/* Private writable mapping: check memory availability. */
> > +	if (accountable_mapping(vmg->file, map->flags)) {
> > +		map->charged = map->pglen;
> > +		map->charged -= vms->nr_accounted;
> > +		if (map->charged) {
> > +			error = security_vm_enough_memory_mm(map->mm, map->charged);
> >  			if (error)
> > -				goto abort_munmap;
> > +				return error;
> >  		}
> >
> > -		vms.nr_accounted = 0;
> > -		vm_flags |= VM_ACCOUNT;
> > -		vmg.flags = vm_flags;
> > +		vms->nr_accounted = 0;
> > +		map->flags |= VM_ACCOUNT;
> >  	}
> >
> >  	/*
> > -	 * clear PTEs while the vma is still in the tree so that rmap
> > +	 * Clear PTEs while the vma is still in the tree so that rmap
> >  	 * cannot race with the freeing later in the truncate scenario.
> >  	 * This is also needed for mmap_file(), which is why vm_ops
> >  	 * close function is called.
> >  	 */
> > -	vms_clean_up_area(&vms, &mas_detach);
> > -	vma = vma_merge_new_range(&vmg);
> > -	if (vma)
> > -		goto expanded;
> > +	vms_clean_up_area(vms, &map->mas_detach);
> > +
> > +	return 0;
> > +}
> > +
> > +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> > +			       struct vm_area_struct **mergep)
> > +{
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	int error;
> > +
> > +	vma->vm_file = get_file(vmg->file);
> > +	error = mmap_file(vma->vm_file, vma);
> > +	if (error) {
> > +		fput(vma->vm_file);
> > +		vma->vm_file = NULL;
> > +
> > +		vma_iter_set(vmi, vma->vm_end);
> > +		/* Undo any partial mapping done by a device driver. */
> > +		unmap_region(&vmi->mas, vma, vmg->prev, vmg->next);
> > +
> > +		return error;
> > +	}
> > +
> > +	/* Drivers cannot alter the address of the VMA. */
> > +	WARN_ON_ONCE(vmg->start != vma->vm_start);
> > +	/*
> > +	 * Drivers should not permit writability when previously it was
> > +	 * disallowed.
> > +	 */
> > +	VM_WARN_ON_ONCE(map->flags != vma->vm_flags &&
> > +			!(map->flags & VM_MAYWRITE) &&
> > +			(vma->vm_flags & VM_MAYWRITE));
> > +
> > +	vma_iter_config(vmi, vmg->start, vmg->end);
> > +	/*
> > +	 * If flags changed after mmap_file(), we should try merge
> > +	 * vma again as we may succeed this time.
> > +	 */
> > +	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> > +		struct vm_area_struct *merge;
> > +
> > +		vmg->flags = vma->vm_flags;
> > +		/* If this fails, state is reset ready for a reattempt. */
> > +		merge = vma_merge_new_range(vmg);
> > +
> > +		if (merge) {
> > +			/*
> > +			 * ->mmap() can change vma->vm_file and fput
> > +			 * the original file. So fput the vma->vm_file
> > +			 * here or we would add an extra fput for file
> > +			 * and cause general protection fault
> > +			 * ultimately.
> > +			 */
> > +			fput(vma->vm_file);
> > +			vm_area_free(vma);
> > +			vma_iter_free(vmi);
> > +			*mergep = merge;
> > +		} else {
> > +			vma_iter_config(vmi, vmg->start, vmg->end);
> > +		}
> > +	}
> > +
> > +	map->flags = vma->vm_flags;
> > +	return 0;
> > +}
> > +
> > +/*
> > + * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
> > + *                    possible.
> > + *
> > + *                    An exception to this is if the mapping is file-backed, and
> > + *                    the underlying driver changes the VMA flags, permitting a
> > + *                    subsequent merge of the VMA, in which case the returned
> > + *                    VMA is one that was merged on a second attempt.
>
> It seems all the descriptions have indented lines.

I like it that way :( will change :'(

>
> > + *
> > + * @map:  Mapping state.
> > + * @vmap: Output pointer for the new VMA.
> > + *
> > + * Returns: Zero on success, or an error.
> > + */
> > +static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > +{
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_merge_struct *vmg = map->vmg;
> > +	struct vm_area_struct *merge = NULL;
> > +	int error = 0;
> > +	struct vm_area_struct *vma;
> > +
> >  	/*
> >  	 * Determine the object being mapped and call the appropriate
> >  	 * specific mapper. the address has already been validated, but
> >  	 * not unmapped, but the maps are removed from the list.
> >  	 */
> > -	vma = vm_area_alloc(mm);
> > -	if (!vma) {
> > -		error = -ENOMEM;
> > -		goto unacct_error;
> > -	}
> > +	vma = vm_area_alloc(map->mm);
> > +	if (!vma)
> > +		return -ENOMEM;
> >
> > -	vma_iter_config(&vmi, addr, end);
> > -	vma_set_range(vma, addr, end, pgoff);
> > -	vm_flags_init(vma, vm_flags);
> > -	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > +	vma_iter_config(vmi, vmg->start, vmg->end);
>
> This function is only called from __mmap_region() after an
> __mmap_prepare() and vma_merge_new_range().  The state should be fine,
> so maybe this could be WARN_ONs?  Although, it's probably safer to just
> leave it.

Yeah perhaps one for the laterbase?

>
> > +	vma_set_range(vma, vmg->start, vmg->end, vmg->pgoff);
> > +	vm_flags_init(vma, map->flags);
> > +	vma->vm_page_prot = vm_get_page_prot(map->flags);
> >
> > -	if (vma_iter_prealloc(&vmi, vma)) {
> > +	if (vma_iter_prealloc(vmi, vma)) {
> >  		error = -ENOMEM;
> >  		goto free_vma;
> >  	}
> >
> > -	if (file) {
> > -		vma->vm_file = get_file(file);
> > -		error = mmap_file(file, vma);
> > -		if (error)
> > -			goto unmap_and_free_file_vma;
> > -
> > -		/* Drivers cannot alter the address of the VMA. */
> > -		WARN_ON_ONCE(addr != vma->vm_start);
> > -		/*
> > -		 * Drivers should not permit writability when previously it was
> > -		 * disallowed.
> > -		 */
> > -		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> > -				!(vm_flags & VM_MAYWRITE) &&
> > -				(vma->vm_flags & VM_MAYWRITE));
> > -
> > -		vma_iter_config(&vmi, addr, end);
> > -		/*
> > -		 * If vm_flags changed after mmap_file(), we should try merge
> > -		 * vma again as we may succeed this time.
> > -		 */
> > -		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> > -			struct vm_area_struct *merge;
> > -
> > -			vmg.flags = vma->vm_flags;
> > -			/* If this fails, state is reset ready for a reattempt. */
> > -			merge = vma_merge_new_range(&vmg);
> > -
> > -			if (merge) {
> > -				/*
> > -				 * ->mmap() can change vma->vm_file and fput
> > -				 * the original file. So fput the vma->vm_file
> > -				 * here or we would add an extra fput for file
> > -				 * and cause general protection fault
> > -				 * ultimately.
> > -				 */
> > -				fput(vma->vm_file);
> > -				vm_area_free(vma);
> > -				vma_iter_free(&vmi);
> > -				vma = merge;
> > -				/* Update vm_flags to pick up the change. */
> > -				vm_flags = vma->vm_flags;
> > -				goto file_expanded;
> > -			}
> > -			vma_iter_config(&vmi, addr, end);
> > -		}
> > -
> > -		vm_flags = vma->vm_flags;
> > -	} else if (vm_flags & VM_SHARED) {
> > +	if (vmg->file)
> > +		error = __mmap_new_file_vma(map, vma, &merge);
> > +	else if (map->flags & VM_SHARED)
> >  		error = shmem_zero_setup(vma);
> > -		if (error)
> > -			goto free_iter_vma;
> > -	} else {
> > +	else
> >  		vma_set_anonymous(vma);
> > -	}
> > +
> > +	if (error)
> > +		goto free_iter_vma;
> > +
> > +	if (merge)
> > +		goto file_expanded;
> >
> >  #ifdef CONFIG_SPARC64
> >  	/* TODO: Fix SPARC ADI! */
> > -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> > +	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> >  #endif
> >
> >  	/* Lock the VMA since it is modified after insertion into VMA tree */
> >  	vma_start_write(vma);
> > -	vma_iter_store(&vmi, vma);
> > -	mm->map_count++;
> > +	vma_iter_store(vmi, vma);
> > +	map->mm->map_count++;
> >  	vma_link_file(vma);
> >
> >  	/*
> >  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
> >  	 * call covers the non-merge case.
> >  	 */
> > -	khugepaged_enter_vma(vma, vma->vm_flags);
> > +	khugepaged_enter_vma(vma, map->flags);
> >
> >  file_expanded:
> > -	file = vma->vm_file;
> >  	ksm_add_vma(vma);
>
> __mmap_new_file_vma() may free the vma.  I assume this is what you
> mentioned elsewhere about UAF.

Yeah no this is a new one, let me totally rework this thing and make it
operate on a single VMA and return a merged boolean and reduce the window
between 'pointer freed and pointer reassgined to literally ZERO LINES.

Sorry, I was probably feverish when I did this bit...

>
> > -expanded:
> > +
>
> Extra whitespace.

You hate that whitespace :(

>
> > +	*vmap = vma;
> > +	return 0;
> > +
> > +free_iter_vma:
> > +	vma_iter_free(vmi);
> > +free_vma:
> > +	vm_area_free(vma);
> > +	return error;
> > +}
> > +
> > +/*
> > + * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
> > + *                     statistics, handle locking and finalise the VMA.
> > + *
> > + * @map: Mapping state.
> > + * @vma: Merged or newly allocated VMA for the mmap()'d region.
> > + */
> > +static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> > +{
> > +	struct mm_struct *mm = map->mm;
> > +	unsigned long vm_flags = vma->vm_flags;
> > +
> >  	perf_event_mmap(vma);
> >
> > -	/* Unmap any existing mapping in the area */
> > -	vms_complete_munmap_vmas(&vms, &mas_detach);
> > +	/* Unmap any existing mapping in the area. */
> > +	vms_complete_munmap_vmas(&map->vms, &map->mas_detach);
> >
> > -	vm_stat_account(mm, vm_flags, pglen);
> > +	vm_stat_account(mm, vma->vm_flags, map->pglen);
> >  	if (vm_flags & VM_LOCKED) {
> >  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> >  					is_vm_hugetlb_page(vma) ||
> > -					vma == get_gate_vma(current->mm))
> > +					vma == get_gate_vma(mm))
> >  			vm_flags_clear(vma, VM_LOCKED_MASK);
> >  		else
> > -			mm->locked_vm += pglen;
> > +			mm->locked_vm += map->pglen;
> >  	}
> >
> > -	if (file)
> > +	if (vma->vm_file)
> >  		uprobe_mmap(vma);
> >
> >  	/*
> > @@ -2364,26 +2447,43 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
> >  	vm_flags_set(vma, VM_SOFTDIRTY);
> >
> >  	vma_set_page_prot(vma);
> > +}
> >
> > -	return addr;
> > +unsigned long __mmap_region(struct file *file, unsigned long addr,
> > +		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > +		struct list_head *uf)
> > +{
> > +	struct mm_struct *mm = current->mm;
> > +	struct vm_area_struct *vma;
> > +	int error;
> > +	VMA_ITERATOR(vmi, mm, addr);
> > +	VMG_STATE(vmg, mm, &vmi, addr, addr + len, vm_flags, pgoff);
> > +	MMAP_STATE(map, mm, &vmi, &vmg, uf, vm_flags, len);
> >
> > -unmap_and_free_file_vma:
> > -	fput(vma->vm_file);
> > -	vma->vm_file = NULL;
> > +	vmg.file = file;
> >
> > -	vma_iter_set(&vmi, vma->vm_end);
> > -	/* Undo any partial mapping done by a device driver. */
> > -	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
> > -free_iter_vma:
> > -	vma_iter_free(&vmi);
> > -free_vma:
> > -	vm_area_free(vma);
> > -unacct_error:
> > -	if (charged)
> > -		vm_unacct_memory(charged);
> > +	error = __mmap_prepare(&map);
> > +	if (error)
> > +		goto abort_munmap;
> > +
> > +	/* Attempt to merge with adjacent VMAs... */
> > +	vmg.flags = map.flags;
> > +	vma = vma_merge_new_range(&vmg);
> > +	if (!vma) {
> > +		/* ...but if we can't, allocate a new VMA. */
> > +		error = __mmap_new_vma(&map, &vma);
> > +		if (error)
> > +			goto unacct_error;
> > +	}
> > +
> > +	__mmap_complete(&map, vma);
> >
> > +	return addr;
> > +
> > +unacct_error:
> > +	if (map.charged)
> > +		vm_unacct_memory(map.charged);
>
> So this is effectively undoing __mmap_prepare()'s accounting. I don't
> have a better label for it, but it's not obvious by the label that the
> accounting was done in __mmap_prepare().

There's a comment in the description of the function. I'll add a comment
here too to be clear about it.

>
> >  abort_munmap:
> > -	vms_abort_munmap_vmas(&vms, &mas_detach);
> > -gather_failed:
> > +	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
> >  	return error;
> >  }
> > --
> > 2.47.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region()
  2024-10-23 17:39       ` Lorenzo Stoakes
@ 2024-10-23 18:12         ` Liam R. Howlett
  0 siblings, 0 replies; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 18:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Andrew Morton, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241023 13:39]:
> On Wed, Oct 23, 2024 at 11:21:54AM -0400, Liam R. Howlett wrote:
> > * Vlastimil Babka <vbabka@suse.cz> [241023 10:39]:
> > > On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > > > We have seen bugs and resource leaks arise from the complexity of the
> > > > __mmap_region() function. This, and the generally deeply fragile error
> > > > handling logic and complexity which makes understanding the function
> > > > difficult make it highly desirable to refactor it into something readable.
> > > >
> > > > Achieve this by separating the function into smaller logical parts which
> > > > are easier to understand and follow, and which importantly very
> > > > significantly simplify the error handling.
> > > >
> > > > Note that we now call vms_abort_munmap_vmas() in more error paths than we
> > > > used to, however in cases where no abort need occur, vms->nr_pages will be
> > > > equal to zero and we simply exit this function without doing more than we
> > > > would have done previously.
> > > >
> > > > Importantly, the invocation of the driver mmap hook via mmap_file() now has
> > > > very simple and obvious handling (this was previously the most problematic
> > > > part of the mmap() operation).
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > ---
> > > >  mm/vma.c | 380 +++++++++++++++++++++++++++++++++++--------------------
> > > >  1 file changed, 240 insertions(+), 140 deletions(-)
> > > >
> > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > index 7617f9d50d62..a271e2b406ab 100644
> > > > --- a/mm/vma.c
> > > > +++ b/mm/vma.c
> > > > @@ -7,6 +7,31 @@
> > > >  #include "vma_internal.h"
> > > >  #include "vma.h"
> > > >
> > > > +struct mmap_state {
> > > > +	struct mm_struct *mm;
> > > > +	struct vma_iterator *vmi;
> > > > +	struct vma_merge_struct *vmg;
> > > > +	struct list_head *uf;
> > > > +
> > > > +	struct vma_munmap_struct vms;
> > > > +	struct ma_state mas_detach;
> > > > +	struct maple_tree mt_detach;
> > > > +
> > > > +	unsigned long flags;
> > > > +	unsigned long pglen;
> > > > +	unsigned long charged;
> > > > +};
> > > > +
> > > > +#define MMAP_STATE(name, mm_, vmi_, vmg_, uf_, flags_, len_)	\
> > > > +	struct mmap_state name = {				\
> > > > +		.mm = mm_,					\
> > > > +		.vmi = vmi_,					\
> > > > +		.vmg = vmg_,					\
> > > > +		.uf = uf_,					\
> > > > +		.flags = flags_,				\
> > > > +		.pglen = PHYS_PFN(len_),			\
> > > > +	}
> > > > +
> > > >  static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
> > > >  {
> > > >  	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
> > > > @@ -2169,189 +2194,247 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> > > >  	vms_complete_munmap_vmas(vms, mas_detach);
> > > >  }
> > > >
> > > > -unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > > -		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> > > > -		struct list_head *uf)
> > > > +/*
> > > > + * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
> > > > + *                    unmapped once the map operation is completed, check limits,
> > > > + *                    account mapping and clean up any pre-existing VMAs.
> > > > + *
> > > > + * @map: Mapping state.
> > > > + *
> > > > + * Returns: 0 on success, error code otherwise.
> > > > + */
> > > > +static int __mmap_prepare(struct mmap_state *map)
> > > >  {
> > > > -	struct mm_struct *mm = current->mm;
> > > > -	struct vm_area_struct *vma = NULL;
> > > > -	pgoff_t pglen = PHYS_PFN(len);
> > > > -	unsigned long charged = 0;
> > > > -	struct vma_munmap_struct vms;
> > > > -	struct ma_state mas_detach;
> > > > -	struct maple_tree mt_detach;
> > > > -	unsigned long end = addr + len;
> > > >  	int error;
> > > > -	VMA_ITERATOR(vmi, mm, addr);
> > > > -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > > > -
> > > > -	vmg.file = file;
> > > > -	/* Find the first overlapping VMA */
> > > > -	vma = vma_find(&vmi, end);
> > > > -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > > > -	if (vma) {
> > > > -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > > > -		mt_on_stack(mt_detach);
> > > > -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > > > +	struct vma_iterator *vmi = map->vmi;
> > > > +	struct vma_merge_struct *vmg = map->vmg;
> > > > +	struct vma_munmap_struct *vms = &map->vms;
> > > > +
> > > > +	/* Find the first overlapping VMA and initialise unmap state. */
> > > > +	vms->vma = vma_find(vmi, vmg->end);
> > > > +	init_vma_munmap(vms, vmi, vms->vma, vmg->start, vmg->end, map->uf,
> > > > +			/* unlock = */ false);
> > > > +
> > > > +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> > > > +	if (vms->vma) {
> > > > +		mt_init_flags(&map->mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > > > +		mt_on_stack(map->mt_detach);
> > > > +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
> > > >  		/* Prepare to unmap any existing mapping in the area */
> > > > -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> > > > +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
> > > >  		if (error)
> > > > -			goto gather_failed;
> > > > +			return error;
> > >
> > > So this assumes vms_abort_munmap_vmas() will rely on the "vms->nr_pages will
> > > be equal to zero" mentioned in commit log. But AFAICS
> > > vms_gather_munmap_vmas() can fail in Nth iteration of its
> > > for_each_vma_range() after some iterations already increased nr_pages and it
> > > will do a reattach_vmas() and return the error and we just return the error
> > > here.
> > > I think either here or maybe in vms_gather_munmap_vmas() itself a reset of
> > > vms->nr_pages to zero on error should happen for the vms_abort_munmap_vmas()
> > > to be a no-op?
> >
> > Probably in reattach_vmas()?
> 
> Hm, but that only accepts a mas and seems redundant elsewhere... am going for
> simply resetting nr_pages for now and maybe we can revisit if needs be?

Okay.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs
  2024-10-23 15:16     ` Lorenzo Stoakes
@ 2024-10-23 18:16       ` Liam R. Howlett
  2024-10-23 18:20         ` Lorenzo Stoakes
  0 siblings, 1 reply; 46+ messages in thread
From: Liam R. Howlett @ 2024-10-23 18:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Andrew Morton, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241023 11:16]:
> On Wed, Oct 23, 2024 at 05:01:29PM +0200, Vlastimil Babka wrote:
> > On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > > Previously, we'd always try to merge a file-backed VMA if its flags were
> > > changed by the driver.
> > >
> > > This however is rarely meaningful as typically the flags would be changed
> > > to VM_PFNMAP or other VM_SPECIAL flags which are inherently unmergable.
> > >
> > > In cases where it is meaningful (for instance DAX) it is doubtful that this
> >
> > Hm if that's true, I'm imagining many piecemeal mmap()s of DAX that used to
> > merge but now will create tons of VMA's, which doesn't sound great. Then it
> > has also potentially breaking effects on mremap() which doesn't work accross
> > multiple VMA's.
> 
> I said this repeatedly to you and Liam but you both still seemed to want
> this :))
> 
> Anyway, yes. I mean you're pretty crazy if you are mapping a bunch of
> adjacent DAX ranges that are all otherwise mergeable next to one another,
> then on that basis assuming that you can mremap() the whole thing.

Today, this will only work if there is a previous vma, otherwise we skip
the attempt to merge.  It's probably a safe bet that there is a previous
vma, but it still means relying on this merging is broken today.

> 
> >
> > > optimisation is worth the effort and maintenance risk of having to unwind
> > > state and perform a merge.
> >
> > What if we simply created a new vma but only then checked if the flags
> > changed and we can merge it with neighbours (i.e. like the mprotect()
> > merging case). Less efficient, but less tricky and with the same result
> > hopefully?
> 
> I'd probably rather just drop this idea rather than wade into something
> entirely new, but let me look at whether we can leverage
> vma_modify_flags().
> 
> I have a feeling we can't because we already explicitly reset state in the
> merge new VMA case, and we'd be introducing a new way in which state could
> get mangled.
> 
> But I'll take a look and see, otherwise we can just drop this for now and
> potentially come back to it later, the key bit of the non-backport patches
> are 5-7 anyway.
> 
> >
> > > Since we've observed bugs and resource leaks due to complexity in this
> > > area, it is simply not acceptable to have a 'nice to have' optimisation
> > > like this complicating an already very complicated code path, so let's
> > > simply eliminate it.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > ---
> > >  mm/vma.c | 39 ++-------------------------------------
> > >  1 file changed, 2 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index a271e2b406ab..fe1fe5099e78 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -2260,8 +2260,7 @@ static int __mmap_prepare(struct mmap_state *map)
> > >  	return 0;
> > >  }
> > >
> > > -static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> > > -			       struct vm_area_struct **mergep)
> > > +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma)
> > >  {
> > >  	struct vma_iterator *vmi = map->vmi;
> > >  	struct vma_merge_struct *vmg = map->vmg;
> > > @@ -2291,34 +2290,6 @@ static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vm
> > >  			(vma->vm_flags & VM_MAYWRITE));
> > >
> > >  	vma_iter_config(vmi, vmg->start, vmg->end);
> > > -	/*
> > > -	 * If flags changed after mmap_file(), we should try merge
> > > -	 * vma again as we may succeed this time.
> > > -	 */
> > > -	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> > > -		struct vm_area_struct *merge;
> > > -
> > > -		vmg->flags = vma->vm_flags;
> > > -		/* If this fails, state is reset ready for a reattempt. */
> > > -		merge = vma_merge_new_range(vmg);
> > > -
> > > -		if (merge) {
> > > -			/*
> > > -			 * ->mmap() can change vma->vm_file and fput
> > > -			 * the original file. So fput the vma->vm_file
> > > -			 * here or we would add an extra fput for file
> > > -			 * and cause general protection fault
> > > -			 * ultimately.
> > > -			 */
> > > -			fput(vma->vm_file);
> > > -			vm_area_free(vma);
> > > -			vma_iter_free(vmi);
> > > -			*mergep = merge;
> > > -		} else {
> > > -			vma_iter_config(vmi, vmg->start, vmg->end);
> > > -		}
> > > -	}
> > > -
> > >  	map->flags = vma->vm_flags;
> > >  	return 0;
> > >  }
> > > @@ -2341,7 +2312,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > >  {
> > >  	struct vma_iterator *vmi = map->vmi;
> > >  	struct vma_merge_struct *vmg = map->vmg;
> > > -	struct vm_area_struct *merge = NULL;
> > >  	int error = 0;
> > >  	struct vm_area_struct *vma;
> > >
> > > @@ -2365,7 +2335,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > >  	}
> > >
> > >  	if (vmg->file)
> > > -		error = __mmap_new_file_vma(map, vma, &merge);
> > > +		error = __mmap_new_file_vma(map, vma);
> > >  	else if (map->flags & VM_SHARED)
> > >  		error = shmem_zero_setup(vma);
> > >  	else
> > > @@ -2374,9 +2344,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > >  	if (error)
> > >  		goto free_iter_vma;
> > >
> > > -	if (merge)
> > > -		goto file_expanded;
> > > -
> > >  #ifdef CONFIG_SPARC64
> > >  	/* TODO: Fix SPARC ADI! */
> > >  	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> > > @@ -2393,8 +2360,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > >  	 * call covers the non-merge case.
> > >  	 */
> > >  	khugepaged_enter_vma(vma, map->flags);
> > > -
> > > -file_expanded:
> > >  	ksm_add_vma(vma);
> > >
> > >  	*vmap = vma;
> > > --
> > > 2.47.0
> >


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs
  2024-10-23 18:16       ` Liam R. Howlett
@ 2024-10-23 18:20         ` Lorenzo Stoakes
  0 siblings, 0 replies; 46+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 18:20 UTC (permalink / raw)
  To: Liam R. Howlett, Vlastimil Babka, Andrew Morton, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu

On Wed, Oct 23, 2024 at 02:16:19PM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241023 11:16]:
> > On Wed, Oct 23, 2024 at 05:01:29PM +0200, Vlastimil Babka wrote:
> > > On 10/22/24 22:40, Lorenzo Stoakes wrote:
> > > > Previously, we'd always try to merge a file-backed VMA if its flags were
> > > > changed by the driver.
> > > >
> > > > This however is rarely meaningful as typically the flags would be changed
> > > > to VM_PFNMAP or other VM_SPECIAL flags which are inherently unmergable.
> > > >
> > > > In cases where it is meaningful (for instance DAX) it is doubtful that this
> > >
> > > Hm if that's true, I'm imagining many piecemeal mmap()s of DAX that used to
> > > merge but now will create tons of VMA's, which doesn't sound great. Then it
> > > has also potentially breaking effects on mremap() which doesn't work accross
> > > multiple VMA's.
> >
> > I said this repeatedly to you and Liam but you both still seemed to want
> > this :))
> >
> > Anyway, yes. I mean you're pretty crazy if you are mapping a bunch of
> > adjacent DAX ranges that are all otherwise mergeable next to one another,
> > then on that basis assuming that you can mremap() the whole thing.
>
> Today, this will only work if there is a previous vma, otherwise we skip
> the attempt to merge.  It's probably a safe bet that there is a previous
> vma, but it still means relying on this merging is broken today.

True, but I also have a cunning plan, based on what Vlasta was getting at
with another idea that was possibly not quite as cunning that means we
might be able to still do this but in a much less terrible way...

>
> >
> > >
> > > > optimisation is worth the effort and maintenance risk of having to unwind
> > > > state and perform a merge.
> > >
> > > What if we simply created a new vma but only then checked if the flags
> > > changed and we can merge it with neighbours (i.e. like the mprotect()
> > > merging case). Less efficient, but less tricky and with the same result
> > > hopefully?
> >
> > I'd probably rather just drop this idea rather than wade into something
> > entirely new, but let me look at whether we can leverage
> > vma_modify_flags().
> >
> > I have a feeling we can't because we already explicitly reset state in the
> > merge new VMA case, and we'd be introducing a new way in which state could
> > get mangled.
> >
> > But I'll take a look and see, otherwise we can just drop this for now and
> > potentially come back to it later, the key bit of the non-backport patches
> > are 5-7 anyway.
> >
> > >
> > > > Since we've observed bugs and resource leaks due to complexity in this
> > > > area, it is simply not acceptable to have a 'nice to have' optimisation
> > > > like this complicating an already very complicated code path, so let's
> > > > simply eliminate it.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > ---
> > > >  mm/vma.c | 39 ++-------------------------------------
> > > >  1 file changed, 2 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > index a271e2b406ab..fe1fe5099e78 100644
> > > > --- a/mm/vma.c
> > > > +++ b/mm/vma.c
> > > > @@ -2260,8 +2260,7 @@ static int __mmap_prepare(struct mmap_state *map)
> > > >  	return 0;
> > > >  }
> > > >
> > > > -static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma,
> > > > -			       struct vm_area_struct **mergep)
> > > > +static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vma)
> > > >  {
> > > >  	struct vma_iterator *vmi = map->vmi;
> > > >  	struct vma_merge_struct *vmg = map->vmg;
> > > > @@ -2291,34 +2290,6 @@ static int __mmap_new_file_vma(struct mmap_state *map, struct vm_area_struct *vm
> > > >  			(vma->vm_flags & VM_MAYWRITE));
> > > >
> > > >  	vma_iter_config(vmi, vmg->start, vmg->end);
> > > > -	/*
> > > > -	 * If flags changed after mmap_file(), we should try merge
> > > > -	 * vma again as we may succeed this time.
> > > > -	 */
> > > > -	if (unlikely(map->flags != vma->vm_flags && vmg->prev)) {
> > > > -		struct vm_area_struct *merge;
> > > > -
> > > > -		vmg->flags = vma->vm_flags;
> > > > -		/* If this fails, state is reset ready for a reattempt. */
> > > > -		merge = vma_merge_new_range(vmg);
> > > > -
> > > > -		if (merge) {
> > > > -			/*
> > > > -			 * ->mmap() can change vma->vm_file and fput
> > > > -			 * the original file. So fput the vma->vm_file
> > > > -			 * here or we would add an extra fput for file
> > > > -			 * and cause general protection fault
> > > > -			 * ultimately.
> > > > -			 */
> > > > -			fput(vma->vm_file);
> > > > -			vm_area_free(vma);
> > > > -			vma_iter_free(vmi);
> > > > -			*mergep = merge;
> > > > -		} else {
> > > > -			vma_iter_config(vmi, vmg->start, vmg->end);
> > > > -		}
> > > > -	}
> > > > -
> > > >  	map->flags = vma->vm_flags;
> > > >  	return 0;
> > > >  }
> > > > @@ -2341,7 +2312,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > > >  {
> > > >  	struct vma_iterator *vmi = map->vmi;
> > > >  	struct vma_merge_struct *vmg = map->vmg;
> > > > -	struct vm_area_struct *merge = NULL;
> > > >  	int error = 0;
> > > >  	struct vm_area_struct *vma;
> > > >
> > > > @@ -2365,7 +2335,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > > >  	}
> > > >
> > > >  	if (vmg->file)
> > > > -		error = __mmap_new_file_vma(map, vma, &merge);
> > > > +		error = __mmap_new_file_vma(map, vma);
> > > >  	else if (map->flags & VM_SHARED)
> > > >  		error = shmem_zero_setup(vma);
> > > >  	else
> > > > @@ -2374,9 +2344,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > > >  	if (error)
> > > >  		goto free_iter_vma;
> > > >
> > > > -	if (merge)
> > > > -		goto file_expanded;
> > > > -
> > > >  #ifdef CONFIG_SPARC64
> > > >  	/* TODO: Fix SPARC ADI! */
> > > >  	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> > > > @@ -2393,8 +2360,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
> > > >  	 * call covers the non-merge case.
> > > >  	 */
> > > >  	khugepaged_enter_vma(vma, map->flags);
> > > > -
> > > > -file_expanded:
> > > >  	ksm_add_vma(vma);
> > > >
> > > >  	*vmap = vma;
> > > > --
> > > > 2.47.0
> > >


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2024-10-23 18:21 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-22 20:40 [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
2024-10-22 20:40 ` [PATCH hotfix 6.12 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
2024-10-22 21:14   ` Jann Horn
2024-10-23 16:56     ` Lorenzo Stoakes
2024-10-23  9:11   ` Vlastimil Babka
2024-10-23 14:22   ` Liam R. Howlett
2024-10-22 20:40 ` [PATCH hotfix 6.12 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
2024-10-22 21:15   ` Jann Horn
2024-10-23 17:00     ` Lorenzo Stoakes
2024-10-23  9:24   ` Vlastimil Babka
2024-10-23 16:58     ` Lorenzo Stoakes
2024-10-23 14:26   ` Liam R. Howlett
2024-10-23 14:41   ` Liam R. Howlett
2024-10-22 20:40 ` [PATCH hotfix 6.12 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
2024-10-22 21:15   ` Jann Horn
2024-10-23 16:35     ` Lorenzo Stoakes
2024-10-23  9:57   ` Vlastimil Babka
2024-10-23 14:30   ` Liam R. Howlett
2024-10-23 16:25     ` Linus Torvalds
2024-10-23 16:37     ` Lorenzo Stoakes
2024-10-22 20:40 ` [PATCH hotfix 6.12 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
2024-10-23 12:58   ` Vlastimil Babka
2024-10-23 14:20     ` Liam R. Howlett
2024-10-23 17:11       ` Lorenzo Stoakes
2024-10-23 15:02   ` Liam R. Howlett
2024-10-22 20:40 ` [PATCH hotfix 6.12 5/8] tools: testing: add additional vma_internal.h stubs Lorenzo Stoakes
2024-10-22 20:40 ` [PATCH hotfix 6.12 6/8] mm: insolate mmap internal logic to mm/vma.c Lorenzo Stoakes
2024-10-22 20:40 ` [PATCH hotfix 6.12 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
2024-10-23 14:38   ` Vlastimil Babka
2024-10-23 15:21     ` Liam R. Howlett
2024-10-23 17:39       ` Lorenzo Stoakes
2024-10-23 18:12         ` Liam R. Howlett
2024-10-23 17:30     ` Lorenzo Stoakes
2024-10-23 17:19   ` Liam R. Howlett
2024-10-23 17:52     ` Lorenzo Stoakes
2024-10-22 20:40 ` [PATCH hotfix 6.12 8/8] mm: do not attempt second merge for file-backed VMAs Lorenzo Stoakes
2024-10-23 15:01   ` Vlastimil Babka
2024-10-23 15:16     ` Lorenzo Stoakes
2024-10-23 18:16       ` Liam R. Howlett
2024-10-23 18:20         ` Lorenzo Stoakes
2024-10-22 20:58 ` [PATCH hotfix 6.12 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
2024-10-23  8:37 ` Vlastimil Babka
2024-10-23  8:45   ` Lorenzo Stoakes
2024-10-23 10:22 ` Andrew Morton
2024-10-23 10:25   ` Andrew Morton
2024-10-23 10:26   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox