linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/21] Avoid MAP_FIXED gap exposure
@ 2024-07-17 20:06 Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 01/21] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
                   ` (20 more replies)
  0 siblings, 21 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett

It is now possible to walk the vma tree using the rcu read locks and is
beneficial to do so to reduce lock contention.  Doing so while a
MAP_FIXED mapping is executing means that a reader may see a gap in the
vma tree that should never logically exist - and does not when using the
mmap lock in read mode.  The temporal gap exists because mmap_region()
calls munmap() prior to installing the new mapping.

This patch set stops rcu readers from seeing the temporal gap by
splitting up the munmap() function into two parts.  The first part
prepares the vma tree for modifications by doing the necessary splits
and tracks the vmas marked for removal in a side tree.  The second part
completes the munmapping of the vmas after the vma tree has been
overwritten (either by a MAP_FIXED replacement vma or by a NULL in the
munmap() case).

Please note that rcu walkers will still be able to see a temporary state
of split vmas that may be in the process of being removed, but the
temporal gap will not be exposed.  vma_start_write() are called on both
parts of the split vma, so this state is detectable.

RFC: https://lore.kernel.org/linux-mm/20240531163217.1584450-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20240611180200.711239-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/all/20240625191145.3382793-1-Liam.Howlett@oracle.com/
v3: https://lore.kernel.org/linux-mm/20240704182718.2653918-1-Liam.Howlett@oracle.com/
v4: https://lore.kernel.org/linux-mm/20240710192250.4114783-1-Liam.Howlett@oracle.com/

Changes since v4:
 - rebase on akpm/mm-unstable
 - init_vma_munmap() has an else statement to set start/end to 0 and mm
   to NULL.
 - Don't drop unmap_arch() as powerpc needs it.  Relocate it instead.
 - Call vma->vm_ops->close() before completing the removal of vmas, call
   vma->vm_ops->open() on abort.  This fixes ltp hugemmap06 test.

Liam R. Howlett (21):
  mm/mmap: Correctly position vma_iterator in __split_vma()
  mm/mmap: Introduce abort_munmap_vmas()
  mm/mmap: Introduce vmi_complete_munmap_vmas()
  mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  mm/mmap: Change munmap to use vma_munmap_struct() for accounting and
    surrounding vmas
  mm/mmap: Extract validate_mm() from vma_complete()
  mm/mmap: Inline munmap operation in mmap_region()
  mm/mmap: Expand mmap_region() munmap call
  mm/mmap: Support vma == NULL in init_vma_munmap()
  mm/mmap: Reposition vma iterator in mmap_region()
  mm/mmap: Track start and end of munmap in vma_munmap_struct
  mm/mmap: Clean up unmap_region() argument list
  mm/mmap: Avoid zeroing vma tree in mmap_region()
  mm/mmap: Use PHYS_PFN in mmap_region()
  mm/mmap: Use vms accounted pages in mmap_region()
  mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas()
  mm/mmap: Move can_modify_mm() check down the stack
  ipc/shm, mm: Drop do_vma_munmap()
  mm/mmap: Move may_expand_vm() check in mmap_region()
  mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas()

 include/linux/mm.h |   6 +-
 ipc/shm.c          |   8 +-
 mm/internal.h      |  26 ++
 mm/mmap.c          | 580 ++++++++++++++++++++++++++-------------------
 4 files changed, 375 insertions(+), 245 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 01/21] mm/mmap: Correctly position vma_iterator in __split_vma()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 02/21] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett

The vma iterator may be left pointing to the newly created vma.  This
happens when inserting the new vma at the end of the old vma
(!new_below).

The incorrect position in the vma iterator is not exposed currently
since the vma iterator is repositioned in the munmap path and is not
reused in any of the other paths.

This has limited impact in the current code, but is required for future
changes.

Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/mmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index e42d89f98071..28a46d9ddde0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2414,7 +2414,7 @@ static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
 /*
  * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
  * has already been checked or doesn't make sense to fail.
- * VMA Iterator will point to the end VMA.
+ * VMA Iterator will point to the original vma.
  */
 static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		       unsigned long addr, int new_below)
@@ -2483,6 +2483,9 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	/* Success. */
 	if (new_below)
 		vma_next(vmi);
+	else
+		vma_prev(vmi);
+
 	return 0;
 
 out_free_mpol:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 02/21] mm/mmap: Introduce abort_munmap_vmas()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 01/21] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 03/21] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

Extract clean up of failed munmap() operations from
do_vmi_align_munmap().  This simplifies later patches in the series.

It is worth noting that the mas_for_each() loop now has a different
upper limit.  This should not change the number of vmas visited for
reattaching to the main vma tree (mm_mt), as all vmas are reattached in
both scenarios.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 28a46d9ddde0..babfa50f1411 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2586,6 +2586,22 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
 			 vma->vm_userfaultfd_ctx, anon_vma_name(vma));
 }
 
+/*
+ * abort_munmap_vmas - Undo any munmap work and free resources
+ *
+ * Reattach any detached vmas and free up the maple tree used to track the vmas.
+ */
+static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+{
+	struct vm_area_struct *vma;
+
+	mas_set(mas_detach, 0);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		vma_mark_detached(vma, false);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -2740,11 +2756,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 userfaultfd_error:
 munmap_gather_failed:
 end_split_failed:
-	mas_set(&mas_detach, 0);
-	mas_for_each(&mas_detach, next, end)
-		vma_mark_detached(next, false);
-
-	__mt_destroy(&mt_detach);
+	abort_munmap_vmas(&mas_detach);
 start_split_failed:
 map_count_exceeded:
 	validate_mm(mm);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 03/21] mm/mmap: Introduce vmi_complete_munmap_vmas()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 01/21] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 02/21] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 04/21] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

Extract all necessary operations that need to be completed after the vma
maple tree is updated from a munmap() operation.  Extracting this makes
the later patch in the series easier to understand.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 81 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 55 insertions(+), 26 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index babfa50f1411..bd3378935c70 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2602,6 +2602,58 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 	__mt_destroy(mas_detach->tree);
 }
 
+/*
+ * vmi_complete_munmap_vmas() - Finish the munmap() operation
+ * @vmi: The vma iterator
+ * @vma: The first vma to be munmapped
+ * @mm: The mm struct
+ * @start: The start address
+ * @end: The end address
+ * @unlock: Unlock the mm or not
+ * @mas_detach: them maple state of the detached vma maple tree
+ * @locked_vm: The locked_vm count in the detached vmas
+ *
+ * This function updates the mm_struct, unmaps the region, frees the resources
+ * used for the munmap() and may downgrade the lock - if requested.  Everything
+ * needed to be done once the vma maple tree is updated.
+ */
+static void
+vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		struct mm_struct *mm, unsigned long start,
+		unsigned long end, bool unlock, struct ma_state *mas_detach,
+		unsigned long locked_vm)
+{
+	struct vm_area_struct *prev, *next;
+	int count;
+
+	count = mas_detach->index + 1;
+	mm->map_count -= count;
+	mm->locked_vm -= locked_vm;
+	if (unlock)
+		mmap_write_downgrade(mm);
+
+	prev = vma_iter_prev_range(vmi);
+	next = vma_next(vmi);
+	if (next)
+		vma_iter_prev_range(vmi);
+
+	/*
+	 * We can free page tables without write-locking mmap_lock because VMAs
+	 * were isolated before we downgraded mmap_lock.
+	 */
+	mas_set(mas_detach, 1);
+	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
+		     !unlock);
+	/* Statistics and freeing VMAs */
+	mas_set(mas_detach, 0);
+	remove_mt(mm, mas_detach);
+	validate_mm(mm);
+	if (unlock)
+		mmap_read_unlock(mm);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -2621,7 +2673,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
 		    unsigned long end, struct list_head *uf, bool unlock)
 {
-	struct vm_area_struct *prev, *next = NULL;
+	struct vm_area_struct *next = NULL;
 	struct maple_tree mt_detach;
 	int count = 0;
 	int error = -ENOMEM;
@@ -2725,31 +2777,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto clear_tree_failed;
 
 	/* Point of no return */
-	mm->locked_vm -= locked_vm;
-	mm->map_count -= count;
-	if (unlock)
-		mmap_write_downgrade(mm);
-
-	prev = vma_iter_prev_range(vmi);
-	next = vma_next(vmi);
-	if (next)
-		vma_iter_prev_range(vmi);
-
-	/*
-	 * We can free page tables without write-locking mmap_lock because VMAs
-	 * were isolated before we downgraded mmap_lock.
-	 */
-	mas_set(&mas_detach, 1);
-	unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
-		     !unlock);
-	/* Statistics and freeing VMAs */
-	mas_set(&mas_detach, 0);
-	remove_mt(mm, &mas_detach);
-	validate_mm(mm);
-	if (unlock)
-		mmap_read_unlock(mm);
-
-	__mt_destroy(&mt_detach);
+	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
+				 locked_vm);
 	return 0;
 
 clear_tree_failed:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 04/21] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (2 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 03/21] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 05/21] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
detached maple tree for removal later.  Part of the gathering is the
splitting of vmas that span the boundary.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 80 ++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 58 insertions(+), 22 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index bd3378935c70..0d03fcf2ac0b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2655,32 +2655,30 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 }
 
 /*
- * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * for removal at a later date.  Handles splitting first and last if necessary
+ * and marking the vmas as isolated.
+ *
  * @vmi: The vma iterator
  * @vma: The starting vm_area_struct
  * @mm: The mm_struct
  * @start: The aligned start address to munmap.
  * @end: The aligned end address to munmap.
  * @uf: The userfaultfd list_head
- * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
- * success.
+ * @mas_detach: The maple state tracking the detached tree
+ * @locked_vm: a pointer to store the VM_LOCKED pages count.
  *
- * Return: 0 on success and drops the lock if so directed, error and leaves the
- * lock held otherwise.
+ * Return: 0 on success
  */
 static int
-do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
-		    unsigned long end, struct list_head *uf, bool unlock)
+		    unsigned long end, struct list_head *uf,
+		    struct ma_state *mas_detach, unsigned long *locked_vm)
 {
 	struct vm_area_struct *next = NULL;
-	struct maple_tree mt_detach;
 	int count = 0;
 	int error = -ENOMEM;
-	unsigned long locked_vm = 0;
-	MA_STATE(mas_detach, &mt_detach, 0, 0);
-	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
-	mt_on_stack(mt_detach);
 
 	/*
 	 * If we need to split any vma, do it now to save pain later.
@@ -2719,15 +2717,15 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 				goto end_split_failed;
 		}
 		vma_start_write(next);
-		mas_set(&mas_detach, count);
-		error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
+		mas_set(mas_detach, count++);
+		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
 			goto munmap_gather_failed;
+
 		vma_mark_detached(next, true);
 		if (next->vm_flags & VM_LOCKED)
-			locked_vm += vma_pages(next);
+			*locked_vm += vma_pages(next);
 
-		count++;
 		if (unlikely(uf)) {
 			/*
 			 * If userfaultfd_unmap_prep returns an error the vmas
@@ -2752,7 +2750,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
 	{
-		MA_STATE(test, &mt_detach, 0, 0);
+		MA_STATE(test, mas_detach->tree, 0, 0);
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
@@ -2772,6 +2770,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	while (vma_iter_addr(vmi) > start)
 		vma_iter_prev_range(vmi);
 
+	return 0;
+
+userfaultfd_error:
+munmap_gather_failed:
+end_split_failed:
+	abort_munmap_vmas(mas_detach);
+start_split_failed:
+map_count_exceeded:
+	return error;
+}
+
+/*
+ * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * @vmi: The vma iterator
+ * @vma: The starting vm_area_struct
+ * @mm: The mm_struct
+ * @start: The aligned start address to munmap.
+ * @end: The aligned end address to munmap.
+ * @uf: The userfaultfd list_head
+ * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
+ * success.
+ *
+ * Return: 0 on success and drops the lock if so directed, error and leaves the
+ * lock held otherwise.
+ */
+static int
+do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		    struct mm_struct *mm, unsigned long start,
+		    unsigned long end, struct list_head *uf, bool unlock)
+{
+	struct maple_tree mt_detach;
+	MA_STATE(mas_detach, &mt_detach, 0, 0);
+	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+	mt_on_stack(mt_detach);
+	int error;
+	unsigned long locked_vm = 0;
+
+	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
+				       &mas_detach, &locked_vm);
+	if (error)
+		goto gather_failed;
+
 	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
 	if (error)
 		goto clear_tree_failed;
@@ -2782,12 +2822,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return 0;
 
 clear_tree_failed:
-userfaultfd_error:
-munmap_gather_failed:
-end_split_failed:
 	abort_munmap_vmas(&mas_detach);
-start_split_failed:
-map_count_exceeded:
+gather_failed:
 	validate_mm(mm);
 	return error;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 05/21] mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (3 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 04/21] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 06/21] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

Use a structure to pass along all the necessary information and counters
involved in removing vmas from the mm_struct.

Update vmi_ function names to vms_ to indicate the first argument
type change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/internal.h |  16 ++++++
 mm/mmap.c     | 138 ++++++++++++++++++++++++++------------------------
 2 files changed, 89 insertions(+), 65 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b4d86436565b..9d0f4d1bebba 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1481,6 +1481,22 @@ struct vma_prepare {
 	struct vm_area_struct *remove2;
 };
 
+/*
+ * vma munmap operation
+ */
+struct vma_munmap_struct {
+	struct vma_iterator *vmi;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;	/* The first vma to munmap */
+	struct list_head *uf;		/* Userfaultfd list_head */
+	unsigned long start;		/* Aligned start addr (inclusive) */
+	unsigned long end;		/* Aligned end addr (exclusive) */
+	int vma_count;			/* Number of vmas that will be removed */
+	unsigned long nr_pages;		/* Number of pages being removed */
+	unsigned long locked_vm;	/* Number of locked pages */
+	bool unlock;			/* Unlock after the munmap */
+};
+
 void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 0d03fcf2ac0b..1ed0720c38c5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -500,6 +500,31 @@ static inline void init_vma_prep(struct vma_prepare *vp,
 	init_multi_vma_prep(vp, vma, NULL, NULL, NULL);
 }
 
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation.  Only unlocked on success
+ */
+static inline void init_vma_munmap(struct vma_munmap_struct *vms,
+		struct vma_iterator *vmi, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, struct list_head *uf,
+		bool unlock)
+{
+	vms->vmi = vmi;
+	vms->vma = vma;
+	vms->mm = vma->vm_mm;
+	vms->start = start;
+	vms->end = end;
+	vms->unlock = unlock;
+	vms->uf = uf;
+	vms->vma_count = 0;
+	vms->nr_pages = vms->locked_vm = 0;
+}
 
 /*
  * vma_prepare() - Helper function for handling locking VMAs prior to altering
@@ -2603,81 +2628,63 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 }
 
 /*
- * vmi_complete_munmap_vmas() - Finish the munmap() operation
- * @vmi: The vma iterator
- * @vma: The first vma to be munmapped
- * @mm: The mm struct
- * @start: The start address
- * @end: The end address
- * @unlock: Unlock the mm or not
- * @mas_detach: them maple state of the detached vma maple tree
- * @locked_vm: The locked_vm count in the detached vmas
+ * vms_complete_munmap_vmas() - Finish the munmap() operation
+ * @vms: The vma munmap struct
+ * @mas_detach: The maple state of the detached vmas
  *
- * This function updates the mm_struct, unmaps the region, frees the resources
+ * This updates the mm_struct, unmaps the region, frees the resources
  * used for the munmap() and may downgrade the lock - if requested.  Everything
  * needed to be done once the vma maple tree is updated.
  */
-static void
-vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		struct mm_struct *mm, unsigned long start,
-		unsigned long end, bool unlock, struct ma_state *mas_detach,
-		unsigned long locked_vm)
+
+static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *prev, *next;
-	int count;
+	struct mm_struct *mm;
 
-	count = mas_detach->index + 1;
-	mm->map_count -= count;
-	mm->locked_vm -= locked_vm;
-	if (unlock)
+	mm = vms->mm;
+	mm->map_count -= vms->vma_count;
+	mm->locked_vm -= vms->locked_vm;
+	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	prev = vma_iter_prev_range(vmi);
-	next = vma_next(vmi);
+	prev = vma_iter_prev_range(vms->vmi);
+	next = vma_next(vms->vmi);
 	if (next)
-		vma_iter_prev_range(vmi);
+		vma_iter_prev_range(vms->vmi);
 
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
 	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
-		     !unlock);
+	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
+		     vms->vma_count, !vms->unlock);
 	/* Statistics and freeing VMAs */
 	mas_set(mas_detach, 0);
 	remove_mt(mm, mas_detach);
 	validate_mm(mm);
-	if (unlock)
+	if (vms->unlock)
 		mmap_read_unlock(mm);
 
 	__mt_destroy(mas_detach->tree);
 }
 
 /*
- * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
  * for removal at a later date.  Handles splitting first and last if necessary
  * and marking the vmas as isolated.
  *
- * @vmi: The vma iterator
- * @vma: The starting vm_area_struct
- * @mm: The mm_struct
- * @start: The aligned start address to munmap.
- * @end: The aligned end address to munmap.
- * @uf: The userfaultfd list_head
+ * @vms: The vma munmap struct
  * @mas_detach: The maple state tracking the detached tree
- * @locked_vm: a pointer to store the VM_LOCKED pages count.
  *
  * Return: 0 on success
  */
-static int
-vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		    struct mm_struct *mm, unsigned long start,
-		    unsigned long end, struct list_head *uf,
-		    struct ma_state *mas_detach, unsigned long *locked_vm)
+static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *next = NULL;
-	int count = 0;
 	int error = -ENOMEM;
 
 	/*
@@ -2689,17 +2696,18 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 */
 
 	/* Does it split the first one? */
-	if (start > vma->vm_start) {
+	if (vms->start > vms->vma->vm_start) {
 
 		/*
 		 * Make sure that map_count on return from munmap() will
 		 * not exceed its limit; but let map_count go just above
 		 * its limit temporarily, to help free resources as expected.
 		 */
-		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+		if (vms->end < vms->vma->vm_end &&
+		    vms->mm->map_count >= sysctl_max_map_count)
 			goto map_count_exceeded;
 
-		error = __split_vma(vmi, vma, start, 1);
+		error = __split_vma(vms->vmi, vms->vma, vms->start, 1);
 		if (error)
 			goto start_split_failed;
 	}
@@ -2708,25 +2716,25 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	next = vma;
+	next = vms->vma;
 	do {
 		/* Does it split the end? */
-		if (next->vm_end > end) {
-			error = __split_vma(vmi, next, end, 0);
+		if (next->vm_end > vms->end) {
+			error = __split_vma(vms->vmi, next, vms->end, 0);
 			if (error)
 				goto end_split_failed;
 		}
 		vma_start_write(next);
-		mas_set(mas_detach, count++);
+		mas_set(mas_detach, vms->vma_count++);
 		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
 			goto munmap_gather_failed;
 
 		vma_mark_detached(next, true);
 		if (next->vm_flags & VM_LOCKED)
-			*locked_vm += vma_pages(next);
+			vms->locked_vm += vma_pages(next);
 
-		if (unlikely(uf)) {
+		if (unlikely(vms->uf)) {
 			/*
 			 * If userfaultfd_unmap_prep returns an error the vmas
 			 * will remain split, but userland will get a
@@ -2736,16 +2744,17 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			 * split, despite we could. This is unlikely enough
 			 * failure that it's not worth optimizing it for.
 			 */
-			error = userfaultfd_unmap_prep(next, start, end, uf);
+			error = userfaultfd_unmap_prep(next, vms->start,
+						       vms->end, vms->uf);
 
 			if (error)
 				goto userfaultfd_error;
 		}
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
-		BUG_ON(next->vm_start < start);
-		BUG_ON(next->vm_start > end);
+		BUG_ON(next->vm_start < vms->start);
+		BUG_ON(next->vm_start > vms->end);
 #endif
-	} for_each_vma_range(*vmi, next, end);
+	} for_each_vma_range(*(vms->vmi), next, vms->end);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
@@ -2754,21 +2763,21 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
-		vma_iter_set(vmi, start);
+		vma_iter_set(vms->vmi, vms->start);
 		rcu_read_lock();
-		vma_test = mas_find(&test, count - 1);
-		for_each_vma_range(*vmi, vma_mas, end) {
+		vma_test = mas_find(&test, vms->vma_count - 1);
+		for_each_vma_range(*(vms->vmi), vma_mas, vms->end) {
 			BUG_ON(vma_mas != vma_test);
 			test_count++;
-			vma_test = mas_next(&test, count - 1);
+			vma_test = mas_next(&test, vms->vma_count - 1);
 		}
 		rcu_read_unlock();
-		BUG_ON(count != test_count);
+		BUG_ON(vms->vma_count != test_count);
 	}
 #endif
 
-	while (vma_iter_addr(vmi) > start)
-		vma_iter_prev_range(vmi);
+	while (vma_iter_addr(vms->vmi) > vms->start)
+		vma_iter_prev_range(vms->vmi);
 
 	return 0;
 
@@ -2804,11 +2813,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	MA_STATE(mas_detach, &mt_detach, 0, 0);
 	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
 	mt_on_stack(mt_detach);
+	struct vma_munmap_struct vms;
 	int error;
-	unsigned long locked_vm = 0;
 
-	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
-				       &mas_detach, &locked_vm);
+	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
+	error = vms_gather_munmap_vmas(&vms, &mas_detach);
 	if (error)
 		goto gather_failed;
 
@@ -2817,8 +2826,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto clear_tree_failed;
 
 	/* Point of no return */
-	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
-				 locked_vm);
+	vms_complete_munmap_vmas(&vms, &mas_detach);
 	return 0;
 
 clear_tree_failed:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 06/21] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (4 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 05/21] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 07/21] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

Clean up the code by changing the munmap operation to use a structure
for the accounting and munmap variables.

Since remove_mt() is only called in one location and the contents will
be reduced to almost nothing.  The remains of the function can be added
to vms_complete_munmap_vmas().

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/internal.h |  6 ++++
 mm/mmap.c     | 80 +++++++++++++++++++++++++--------------------------
 2 files changed, 46 insertions(+), 40 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9d0f4d1bebba..02627e269d6b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1488,12 +1488,18 @@ struct vma_munmap_struct {
 	struct vma_iterator *vmi;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;	/* The first vma to munmap */
+	struct vm_area_struct *prev;    /* vma before the munmap area */
+	struct vm_area_struct *next;	/* vma after the munmap area */
 	struct list_head *uf;		/* Userfaultfd list_head */
 	unsigned long start;		/* Aligned start addr (inclusive) */
 	unsigned long end;		/* Aligned end addr (exclusive) */
 	int vma_count;			/* Number of vmas that will be removed */
 	unsigned long nr_pages;		/* Number of pages being removed */
 	unsigned long locked_vm;	/* Number of locked pages */
+	unsigned long nr_accounted;	/* Number of VM_ACCOUNT pages */
+	unsigned long exec_vm;
+	unsigned long stack_vm;
+	unsigned long data_vm;
 	bool unlock;			/* Unlock after the munmap */
 };
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 1ed0720c38c5..62ff7aa10004 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -523,7 +523,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->unlock = unlock;
 	vms->uf = uf;
 	vms->vma_count = 0;
-	vms->nr_pages = vms->locked_vm = 0;
+	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 }
 
 /*
@@ -2388,30 +2389,6 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
 	return vma;
 }
 
-/*
- * Ok - we have the memory areas we should free on a maple tree so release them,
- * and do the vma updates.
- *
- * Called with the mm semaphore held.
- */
-static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
-{
-	unsigned long nr_accounted = 0;
-	struct vm_area_struct *vma;
-
-	/* Update high watermark before we lower total_vm */
-	update_hiwater_vm(mm);
-	mas_for_each(mas, vma, ULONG_MAX) {
-		long nrpages = vma_pages(vma);
-
-		if (vma->vm_flags & VM_ACCOUNT)
-			nr_accounted += nrpages;
-		vm_stat_account(mm, vma->vm_flags, -nrpages);
-		remove_vma(vma, false);
-	}
-	vm_unacct_memory(nr_accounted);
-}
-
 /*
  * Get rid of page table information in the indicated region.
  *
@@ -2632,15 +2609,14 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
  * @vms: The vma munmap struct
  * @mas_detach: The maple state of the detached vmas
  *
- * This updates the mm_struct, unmaps the region, frees the resources
+ * This function updates the mm_struct, unmaps the region, frees the resources
  * used for the munmap() and may downgrade the lock - if requested.  Everything
  * needed to be done once the vma maple tree is updated.
  */
-
 static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
-	struct vm_area_struct *prev, *next;
+	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 
 	mm = vms->mm;
@@ -2649,21 +2625,26 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	prev = vma_iter_prev_range(vms->vmi);
-	next = vma_next(vms->vmi);
-	if (next)
-		vma_iter_prev_range(vms->vmi);
-
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
 	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
-		     vms->vma_count, !vms->unlock);
-	/* Statistics and freeing VMAs */
+	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
+		     vms->start, vms->end, vms->vma_count, !vms->unlock);
+	/* Update high watermark before we lower total_vm */
+	update_hiwater_vm(mm);
+	/* Stat accounting */
+	WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
+	mm->exec_vm -= vms->exec_vm;
+	mm->stack_vm -= vms->stack_vm;
+	mm->data_vm -= vms->data_vm;
+	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
-	remove_mt(mm, mas_detach);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		remove_vma(vma, false);
+
+	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
 	if (vms->unlock)
 		mmap_read_unlock(mm);
@@ -2711,13 +2692,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (error)
 			goto start_split_failed;
 	}
+	vms->prev = vma_prev(vms->vmi);
 
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	next = vms->vma;
-	do {
+	for_each_vma_range(*(vms->vmi), next, vms->end) {
+		long nrpages;
 		/* Does it split the end? */
 		if (next->vm_end > vms->end) {
 			error = __split_vma(vms->vmi, next, vms->end, 0);
@@ -2731,6 +2713,22 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			goto munmap_gather_failed;
 
 		vma_mark_detached(next, true);
+		nrpages = vma_pages(next);
+
+		vms->nr_pages += nrpages;
+		if (next->vm_flags & VM_LOCKED)
+			vms->locked_vm += nrpages;
+
+		if (next->vm_flags & VM_ACCOUNT)
+			vms->nr_accounted += nrpages;
+
+		if (is_exec_mapping(next->vm_flags))
+			vms->exec_vm += nrpages;
+		else if (is_stack_mapping(next->vm_flags))
+			vms->stack_vm += nrpages;
+		else if (is_data_mapping(next->vm_flags))
+			vms->data_vm += nrpages;
+
 		if (next->vm_flags & VM_LOCKED)
 			vms->locked_vm += vma_pages(next);
 
@@ -2754,7 +2752,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		BUG_ON(next->vm_start < vms->start);
 		BUG_ON(next->vm_start > vms->end);
 #endif
-	} for_each_vma_range(*(vms->vmi), next, vms->end);
+	}
+
+	vms->next = vma_next(vms->vmi);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 07/21] mm/mmap: Extract validate_mm() from vma_complete()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (5 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 06/21] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 08/21] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

vma_complete() will need to be called during an unsafe time to call
validate_mm().  Extract the call in all places now so that only one
location can be modified in the next change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 62ff7aa10004..1c9016fb6b5c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -646,7 +646,6 @@ static inline void vma_complete(struct vma_prepare *vp,
 	}
 	if (vp->insert && vp->file)
 		uprobe_mmap(vp->insert);
-	validate_mm(mm);
 }
 
 /*
@@ -734,6 +733,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);
 
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 	return 0;
 
 nomem:
@@ -775,6 +775,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_clear(vmi);
 	vma_set_range(vma, start, end, pgoff);
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 	return 0;
 }
 
@@ -1103,6 +1104,7 @@ static struct vm_area_struct
 	}
 
 	vma_complete(&vp, vmi, mm);
+	validate_mm(mm);
 	khugepaged_enter_vma(res, vm_flags);
 	return res;
 
@@ -2481,6 +2483,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	/* vma_complete stores the new vma */
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 
 	/* Success. */
 	if (new_below)
@@ -3354,6 +3357,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		vma_iter_store(vmi, vma);
 
 		vma_complete(&vp, vmi, mm);
+		validate_mm(mm);
 		khugepaged_enter_vma(vma, flags);
 		goto out;
 	}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 08/21] mm/mmap: Inline munmap operation in mmap_region()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (6 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 07/21] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 09/21] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

mmap_region is already passed sanitized addr and len, so change the
call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
checks.

The inlining of the function and checks is an intermediate step in the
series so future patches are easier to follow.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 1c9016fb6b5c..49b3ab406353 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2938,12 +2938,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			return -ENOMEM;
 	}
 
-	/* Unmap any existing mapping in the area */
-	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
-	if (error == -EPERM)
-		return error;
-	else if (error)
-		return -ENOMEM;
+
+	if (unlikely(!can_modify_mm(mm, addr, end)))
+		return -EPERM;
+
+	 /* arch_unmap() might do unmaps itself.  */
+	arch_unmap(mm, addr, end);
+
+	/* Find the first overlapping VMA */
+	vma = vma_find(&vmi, end);
+	if (vma) {
+		/* Unmap any existing mapping in the area */
+		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+			return -ENOMEM;
+		vma = NULL;
+	}
 
 	/*
 	 * Private writable mapping: check memory availability
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 09/21] mm/mmap: Expand mmap_region() munmap call
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (7 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 08/21] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:06 ` [PATCH v5 10/21] mm/mmap: Support vma == NULL in init_vma_munmap() Liam R. Howlett
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Open code the do_vmi_align_munmap() call so that it can be broken up
later in the series.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 49b3ab406353..a1544a68558e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2916,6 +2916,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct vm_area_struct *next, *prev, *merge;
 	pgoff_t pglen = len >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
+	struct maple_tree mt_detach;
 	unsigned long end = addr + len;
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
@@ -2948,10 +2951,27 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
 	if (vma) {
-		/* Unmap any existing mapping in the area */
-		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(mt_detach);
+		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
+		/* Prepare to unmap any existing mapping in the area */
+		if (vms_gather_munmap_vmas(&vms, &mas_detach))
+			return -ENOMEM;
+
+		/* Remove any existing mappings from the vma tree */
+		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
 			return -ENOMEM;
+
+		/* Unmap any existing mapping in the area */
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+		next = vms.next;
+		prev = vms.prev;
+		vma_prev(&vmi);
 		vma = NULL;
+	} else {
+		next = vma_next(&vmi);
+		prev = vma_prev(&vmi);
 	}
 
 	/*
@@ -2964,8 +2984,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	next = vma_next(&vmi);
-	prev = vma_prev(&vmi);
 	if (vm_flags & VM_SPECIAL) {
 		if (prev)
 			vma_iter_next_range(&vmi);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 10/21] mm/mmap: Support vma == NULL in init_vma_munmap()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (8 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 09/21] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-22 13:14   ` Lorenzo Stoakes
  2024-07-17 20:06 ` [PATCH v5 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Adding support for a NULL vma means the init_vma_munmap() can be
initialized for a less error-prone process when calling
vms_complete_munmap_vmas() later on.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index a1544a68558e..722bcced0499 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -517,9 +517,14 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 {
 	vms->vmi = vmi;
 	vms->vma = vma;
-	vms->mm = vma->vm_mm;
-	vms->start = start;
-	vms->end = end;
+	if (vma) {
+		vms->mm = vma->vm_mm;
+		vms->start = start;
+		vms->end = end;
+	} else {
+		vms->mm = NULL;
+		vms->start = vms->end = 0;
+	}
 	vms->unlock = unlock;
 	vms->uf = uf;
 	vms->vma_count = 0;
@@ -2950,11 +2955,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
+	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
 	if (vma) {
 		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
 		mt_on_stack(mt_detach);
 		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
-		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
 		/* Prepare to unmap any existing mapping in the area */
 		if (vms_gather_munmap_vmas(&vms, &mas_detach))
 			return -ENOMEM;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 11/21] mm/mmap: Reposition vma iterator in mmap_region()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (9 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 10/21] mm/mmap: Support vma == NULL in init_vma_munmap() Liam R. Howlett
@ 2024-07-17 20:06 ` Liam R. Howlett
  2024-07-17 20:07 ` [PATCH v5 12/21] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:06 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of moving (or leaving) the vma iterator pointing at the previous
vma, leave it pointing at the insert location.  Pointing the vma
iterator at the insert location allows for a cleaner walk of the vma
tree for MAP_FIXED and the no expansion cases.

The vma_prev() call in the case of merging the previous vma is
equivalent to vma_iter_prev_range(), since the vma iterator will be
pointing to the location just before the previous vma.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 722bcced0499..b940de8c6df8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2962,21 +2962,22 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
 		/* Prepare to unmap any existing mapping in the area */
 		if (vms_gather_munmap_vmas(&vms, &mas_detach))
-			return -ENOMEM;
+			goto gather_failed;
 
 		/* Remove any existing mappings from the vma tree */
 		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
-			return -ENOMEM;
+			goto clear_tree_failed;
 
 		/* Unmap any existing mapping in the area */
 		vms_complete_munmap_vmas(&vms, &mas_detach);
 		next = vms.next;
 		prev = vms.prev;
-		vma_prev(&vmi);
 		vma = NULL;
 	} else {
 		next = vma_next(&vmi);
 		prev = vma_prev(&vmi);
+		if (prev)
+			vma_iter_next_range(&vmi);
 	}
 
 	/*
@@ -2989,11 +2990,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	if (vm_flags & VM_SPECIAL) {
-		if (prev)
-			vma_iter_next_range(&vmi);
+	if (vm_flags & VM_SPECIAL)
 		goto cannot_expand;
-	}
 
 	/* Attempt to expand an old mapping */
 	/* Check next */
@@ -3014,19 +3012,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		merge_start = prev->vm_start;
 		vma = prev;
 		vm_pgoff = prev->vm_pgoff;
-	} else if (prev) {
-		vma_iter_next_range(&vmi);
+		vma_prev(&vmi); /* Equivalent to going to the previous range */
 	}
 
-	/* Actually expand, if possible */
-	if (vma &&
-	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
-		khugepaged_enter_vma(vma, vm_flags);
-		goto expanded;
+	if (vma) {
+		/* Actually expand, if possible */
+		if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
+			khugepaged_enter_vma(vma, vm_flags);
+			goto expanded;
+		}
+
+		/* If the expand fails, then reposition the vma iterator */
+		if (unlikely(vma == prev))
+			vma_iter_set(&vmi, addr);
 	}
 
-	if (vma == prev)
-		vma_iter_set(&vmi, addr);
 cannot_expand:
 
 	/*
@@ -3187,6 +3187,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_unacct_memory(charged);
 	validate_mm(mm);
 	return error;
+
+clear_tree_failed:
+	abort_munmap_vmas(&mas_detach);
+gather_failed:
+	validate_mm(mm);
+	return -ENOMEM;
 }
 
 static int __vm_munmap(unsigned long start, size_t len, bool unlock)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 12/21] mm/mmap: Track start and end of munmap in vma_munmap_struct
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (10 preceding siblings ...)
  2024-07-17 20:06 ` [PATCH v5 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-17 20:07 ` [PATCH v5 13/21] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Set the start and end address for munmap when the prev and next are
gathered.  This is needed to avoid incorrect addresses being used during
the vms_complete_munmap_vmas() function if the prev/next vma are
expanded.

Add a new helper vms_complete_pte_clear(), which is needed later and
will avoid growing the argument list to unmap_region() beyond the 9 it
already has.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/internal.h |  2 ++
 mm/mmap.c     | 37 ++++++++++++++++++++++++++++++-------
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 02627e269d6b..ec8441362c28 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1493,6 +1493,8 @@ struct vma_munmap_struct {
 	struct list_head *uf;		/* Userfaultfd list_head */
 	unsigned long start;		/* Aligned start addr (inclusive) */
 	unsigned long end;		/* Aligned end addr (exclusive) */
+	unsigned long unmap_start;	/* Unmap PTE start */
+	unsigned long unmap_end;	/* Unmap PTE end */
 	int vma_count;			/* Number of vmas that will be removed */
 	unsigned long nr_pages;		/* Number of pages being removed */
 	unsigned long locked_vm;	/* Number of locked pages */
diff --git a/mm/mmap.c b/mm/mmap.c
index b940de8c6df8..7cc1f47122f6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -530,6 +530,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->vma_count = 0;
 	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+	vms->unmap_start = FIRST_USER_ADDRESS;
+	vms->unmap_end = USER_PGTABLES_CEILING;
 }
 
 /*
@@ -2612,6 +2614,29 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 	__mt_destroy(mas_detach->tree);
 }
 
+
+static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach, bool mm_wr_locked)
+{
+	struct mmu_gather tlb;
+
+	/*
+	 * We can free page tables without write-locking mmap_lock because VMAs
+	 * were isolated before we downgraded mmap_lock.
+	 */
+	mas_set(mas_detach, 1);
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, vms->mm);
+	update_hiwater_rss(vms->mm);
+	unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end,
+		   vms->vma_count, mm_wr_locked);
+	mas_set(mas_detach, 1);
+	/* start and end may be different if there is no prev or next vma. */
+	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
+		      vms->unmap_end, mm_wr_locked);
+	tlb_finish_mmu(&tlb);
+}
+
 /*
  * vms_complete_munmap_vmas() - Finish the munmap() operation
  * @vms: The vma munmap struct
@@ -2633,13 +2658,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	/*
-	 * We can free page tables without write-locking mmap_lock because VMAs
-	 * were isolated before we downgraded mmap_lock.
-	 */
-	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
-		     vms->start, vms->end, vms->vma_count, !vms->unlock);
+	vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
 	/* Stat accounting */
@@ -2701,6 +2720,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			goto start_split_failed;
 	}
 	vms->prev = vma_prev(vms->vmi);
+	if (vms->prev)
+		vms->unmap_start = vms->prev->vm_end;
 
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
@@ -2763,6 +2784,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	}
 
 	vms->next = vma_next(vms->vmi);
+	if (vms->next)
+		vms->unmap_end = vms->next->vm_start;
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 13/21] mm/mmap: Clean up unmap_region() argument list
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (11 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 12/21] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-17 20:07 ` [PATCH v5 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

With the only caller to unmap_region() being the error path of
mmap_region(), the argument list can be significantly reduced.

There is also no need to forward declare the static function any
longer.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 7cc1f47122f6..20da0d039c95 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -76,11 +76,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
-static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
-		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		struct vm_area_struct *next, unsigned long start,
-		unsigned long end, unsigned long tree_end, bool mm_wr_locked);
-
 static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
 {
 	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
@@ -2403,22 +2398,21 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
  *
  * Called with the mm semaphore held.
  */
-static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
-		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		struct vm_area_struct *next, unsigned long start,
-		unsigned long end, unsigned long tree_end, bool mm_wr_locked)
+static void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+		struct vm_area_struct *prev, struct vm_area_struct *next)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long mt_start = mas->index;
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
-	mas_set(mas, mt_start);
+	unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
+		   /* mm_wr_locked = */ true);
+	mas_set(mas, vma->vm_end);
 	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
-				 next ? next->vm_start : USER_PGTABLES_CEILING,
-				 mm_wr_locked);
+		      next ? next->vm_start : USER_PGTABLES_CEILING,
+		      /* mm_wr_locked = */ true);
 	tlb_finish_mmu(&tlb);
 }
 
@@ -3198,8 +3192,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 		vma_iter_set(&vmi, vma->vm_end);
 		/* Undo any partial mapping done by a device driver. */
-		unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
-			     vma->vm_end, vma->vm_end, true);
+		unmap_region(&vmi.mas, vma, prev, next);
 	}
 	if (writable_file_mapping)
 		mapping_unmap_writable(file->f_mapping);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (12 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 13/21] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-22 18:42   ` Lorenzo Stoakes
  2024-07-17 20:07 ` [PATCH v5 15/21] mm/mmap: Use PHYS_PFN " Liam R. Howlett
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of zeroing the vma tree and then overwriting the area, let the
area be overwritten and then clean up the gathered vmas using
vms_complete_munmap_vmas().

If a driver is mapping over an existing vma, then clear the ptes before
the call_mmap() invocation.  If the vma has a vm_ops->close(), then call
the close() function.  This is done using the vms_clear_ptes() and
vms_close_vmas() helpers.  This has the side effect of needing to call
open() on the vmas if the mmap_region() fails later on.

Temporarily keep track of the number of pages that will be removed and
reduce the charged amount.

This commit drops the validate_mm() call in the vma_expand() function.
It is necessary to drop the validate as it would fail since the mm
map_count would be incorrect during a vma expansion, prior to the
cleanup from vms_complete_munmap_vmas().

Clean up the error handing of the vms_gather_munmap_vmas() by calling
the verification within the function.

Note that before this change, a MAP_FIXED could fail and leave a gap in
the vma tree.  With this change, a MAP_FIXED failure will fail without
creating a gap and leave *a* vma in the area (may have been split) and
attept to restore them to an operational state (re-attached and
vm_ops->open()'ed if they were vm_ops->close()'d).

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/internal.h |   2 +
 mm/mmap.c     | 119 +++++++++++++++++++++++++++++++-------------------
 2 files changed, 76 insertions(+), 45 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index ec8441362c28..5bd60cb9fcbb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1503,6 +1503,8 @@ struct vma_munmap_struct {
 	unsigned long stack_vm;
 	unsigned long data_vm;
 	bool unlock;			/* Unlock after the munmap */
+	bool clear_ptes;		/* If there are outstanding PTE to be cleared */
+	bool closed;			/* vma->vm_ops->close() called already */
 };
 
 void __meminit __init_single_page(struct page *page, unsigned long pfn,
diff --git a/mm/mmap.c b/mm/mmap.c
index 20da0d039c95..0b7aa2c46cec 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -170,10 +170,11 @@ void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb)
 /*
  * Close a vm structure and free it.
  */
-static void remove_vma(struct vm_area_struct *vma, bool unreachable)
+static
+void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
 {
 	might_sleep();
-	if (vma->vm_ops && vma->vm_ops->close)
+	if (!closed && vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
@@ -401,17 +402,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 }
 
 static unsigned long count_vma_pages_range(struct mm_struct *mm,
-		unsigned long addr, unsigned long end)
+		unsigned long addr, unsigned long end,
+		unsigned long *nr_accounted)
 {
 	VMA_ITERATOR(vmi, mm, addr);
 	struct vm_area_struct *vma;
 	unsigned long nr_pages = 0;
 
+	*nr_accounted = 0;
 	for_each_vma_range(vmi, vma, end) {
 		unsigned long vm_start = max(addr, vma->vm_start);
 		unsigned long vm_end = min(end, vma->vm_end);
 
 		nr_pages += PHYS_PFN(vm_end - vm_start);
+		if (vma->vm_flags & VM_ACCOUNT)
+			*nr_accounted += PHYS_PFN(vm_end - vm_start);
 	}
 
 	return nr_pages;
@@ -527,6 +532,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 	vms->unmap_start = FIRST_USER_ADDRESS;
 	vms->unmap_end = USER_PGTABLES_CEILING;
+	vms->clear_ptes = false;	/* No PTEs to clear yet */
+	vms->closed = false;
 }
 
 /*
@@ -735,7 +742,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);
 
 	vma_complete(&vp, vmi, vma->vm_mm);
-	validate_mm(vma->vm_mm);
 	return 0;
 
 nomem:
@@ -2597,23 +2603,31 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
  *
  * Reattach any detached vmas and free up the maple tree used to track the vmas.
  */
-static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
 {
 	struct vm_area_struct *vma;
 
 	mas_set(mas_detach, 0);
-	mas_for_each(mas_detach, vma, ULONG_MAX)
+	mas_for_each(mas_detach, vma, ULONG_MAX) {
+		if (closed && vma->vm_ops && vma->vm_ops->close &&
+		    vma->vm_ops->open)
+			vma->vm_ops->open(vma);
+
 		vma_mark_detached(vma, false);
+	}
 
 	__mt_destroy(mas_detach->tree);
 }
 
 
-static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
+static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach, bool mm_wr_locked)
 {
 	struct mmu_gather tlb;
 
+	if (!vms->clear_ptes)
+		return;
+
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
@@ -2629,6 +2643,23 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
 	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
 		      vms->unmap_end, mm_wr_locked);
 	tlb_finish_mmu(&tlb);
+	vms->clear_ptes = false;
+}
+
+static inline void
+vms_close_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach)
+{
+	struct vm_area_struct *vma;
+
+	if (!vms->vma_count)
+		return;
+
+	mas_set(mas_detach, 0);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		if (vma->vm_ops && vma->vm_ops->close)
+			vma->vm_ops->close(vma);
+
+	vms->closed = true;
 }
 
 /*
@@ -2652,7 +2683,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
+	vms_clear_ptes(vms, mas_detach, !vms->unlock);
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
 	/* Stat accounting */
@@ -2663,7 +2694,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		remove_vma(vma, false);
+		remove_vma(vma, /* unreachable = */ false, vms->closed);
 
 	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
@@ -2804,14 +2835,18 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	while (vma_iter_addr(vms->vmi) > vms->start)
 		vma_iter_prev_range(vms->vmi);
 
+	/* There are now PTEs that need to be cleared */
+	vms->clear_ptes = true;
+
 	return 0;
 
 userfaultfd_error:
 munmap_gather_failed:
 end_split_failed:
-	abort_munmap_vmas(mas_detach);
+	abort_munmap_vmas(mas_detach, /* closed = */ false);
 start_split_failed:
 map_count_exceeded:
+	validate_mm(vms->mm);
 	return error;
 }
 
@@ -2855,9 +2890,9 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return 0;
 
 clear_tree_failed:
-	abort_munmap_vmas(&mas_detach);
-gather_failed:
+	abort_munmap_vmas(&mas_detach, /* closed = */ false);
 	validate_mm(mm);
+gather_failed:
 	return error;
 }
 
@@ -2945,24 +2980,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
 	pgoff_t vm_pgoff;
-	int error;
+	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
+	unsigned long nr_pages, nr_accounted;
 
-	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
-		unsigned long nr_pages;
-
-		/*
-		 * MAP_FIXED may remove pages of mappings that intersects with
-		 * requested mapping. Account for the pages it would unmap.
-		 */
-		nr_pages = count_vma_pages_range(mm, addr, end);
-
-		if (!may_expand_vm(mm, vm_flags,
-					(len >> PAGE_SHIFT) - nr_pages))
-			return -ENOMEM;
-	}
+	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
 
+	/*
+	 * Check against address space limit.
+	 * MAP_FIXED may remove pages of mappings that intersects with requested
+	 * mapping. Account for the pages it would unmap.
+	 */
+	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
 		return -EPERM;
@@ -2979,14 +3009,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
 		/* Prepare to unmap any existing mapping in the area */
 		if (vms_gather_munmap_vmas(&vms, &mas_detach))
-			goto gather_failed;
-
-		/* Remove any existing mappings from the vma tree */
-		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
-			goto clear_tree_failed;
+			return -ENOMEM;
 
-		/* Unmap any existing mapping in the area */
-		vms_complete_munmap_vmas(&vms, &mas_detach);
 		next = vms.next;
 		prev = vms.prev;
 		vma = NULL;
@@ -3002,8 +3026,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	if (accountable_mapping(file, vm_flags)) {
 		charged = len >> PAGE_SHIFT;
+		charged -= nr_accounted;
 		if (security_vm_enough_memory_mm(mm, charged))
-			return -ENOMEM;
+			goto abort_munmap;
+		vms.nr_accounted = 0;
 		vm_flags |= VM_ACCOUNT;
 	}
 
@@ -3052,10 +3078,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * not unmapped, but the maps are removed from the list.
 	 */
 	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
+	if (!vma)
 		goto unacct_error;
-	}
 
 	vma_iter_config(&vmi, addr, end);
 	vma_set_range(vma, addr, end, pgoff);
@@ -3064,6 +3088,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	if (file) {
 		vma->vm_file = get_file(file);
+		/* call_mmap() may map PTE, so ensure there are no existing PTEs */
+		vms_clear_ptes(&vms, &mas_detach, /* mm_wr_locked = */ true);
+		vms_close_vmas(&vms, &mas_detach);
 		error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
@@ -3154,6 +3181,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 expanded:
 	perf_event_mmap(vma);
 
+	/* Unmap any existing mapping in the area */
+	if (vms.nr_pages)
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
@@ -3201,14 +3232,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
-	validate_mm(mm);
-	return error;
 
-clear_tree_failed:
-	abort_munmap_vmas(&mas_detach);
-gather_failed:
+abort_munmap:
+	if (vms.nr_pages)
+		abort_munmap_vmas(&mas_detach, vms.closed);
 	validate_mm(mm);
-	return -ENOMEM;
+	return error;
 }
 
 static int __vm_munmap(unsigned long start, size_t len, bool unlock)
@@ -3549,7 +3578,7 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		remove_vma(vma, true);
+		remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
 		count++;
 		cond_resched();
 		vma = vma_next(&vmi);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 15/21] mm/mmap: Use PHYS_PFN in mmap_region()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (13 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-17 20:07 ` [PATCH v5 16/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of shifting the length by PAGE_SIZE, use PHYS_PFN.  Also use the
existing local variable everywhere instead of some of the time.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0b7aa2c46cec..45fb8725a6c5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2971,7 +2971,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	struct vm_area_struct *next, *prev, *merge;
-	pgoff_t pglen = len >> PAGE_SHIFT;
+	pgoff_t pglen = PHYS_PFN(len);
 	unsigned long charged = 0;
 	struct vma_munmap_struct vms;
 	struct ma_state mas_detach;
@@ -2991,7 +2991,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * MAP_FIXED may remove pages of mappings that intersects with requested
 	 * mapping. Account for the pages it would unmap.
 	 */
-	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
 		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
@@ -3025,7 +3025,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * Private writable mapping: check memory availability
 	 */
 	if (accountable_mapping(file, vm_flags)) {
-		charged = len >> PAGE_SHIFT;
+		charged = pglen;
 		charged -= nr_accounted;
 		if (security_vm_enough_memory_mm(mm, charged))
 			goto abort_munmap;
@@ -3185,14 +3185,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (vms.nr_pages)
 		vms_complete_munmap_vmas(&vms, &mas_detach);
 
-	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
+	vm_stat_account(mm, vm_flags, pglen);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
 					is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm))
 			vm_flags_clear(vma, VM_LOCKED_MASK);
 		else
-			mm->locked_vm += (len >> PAGE_SHIFT);
+			mm->locked_vm += pglen;
 	}
 
 	if (file)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 16/21] mm/mmap: Use vms accounted pages in mmap_region()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (14 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 15/21] mm/mmap: Use PHYS_PFN " Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-24 20:23   ` Paul Moore
  2024-07-17 20:07 ` [PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas() Liam R. Howlett
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, linux-security-module, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Change from nr_pages variable to vms.nr_accounted for the charged pages
calculation.  This is necessary for a future patch.

This also avoids checking security_vm_enough_memory_mm() if the amount
of memory won't change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Kees Cook <kees@kernel.org>
Cc: linux-security-module@vger.kernel.org
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 45fb8725a6c5..9f870e715a47 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3026,9 +3026,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	if (accountable_mapping(file, vm_flags)) {
 		charged = pglen;
-		charged -= nr_accounted;
-		if (security_vm_enough_memory_mm(mm, charged))
+		charged -= vms.nr_accounted;
+		if (charged && security_vm_enough_memory_mm(mm, charged))
 			goto abort_munmap;
+
 		vms.nr_accounted = 0;
 		vm_flags |= VM_ACCOUNT;
 	}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (15 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 16/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-22 14:25   ` Lorenzo Stoakes
  2024-07-17 20:07 ` [PATCH v5 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, LEROY Christophe, linuxppc-dev, Dave Hansen

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The arch_unmap call was previously moved above the rbtree modifications
in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
corruption").  The move was motivated by an issue with calling
arch_unmap() after the rbtree was modified.

Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
arch_unmap() prior to modifying the vma tree no longer exists
(regardless of rbtree or maple tree implementations).

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: LEROY Christophe <christophe.leroy2@cs-soprasteria.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Dave Hansen <dave.hansen@intel.com>
---
 mm/mmap.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 9f870e715a47..117e8240f697 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2680,6 +2680,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	mm = vms->mm;
 	mm->map_count -= vms->vma_count;
 	mm->locked_vm -= vms->locked_vm;
+	arch_unmap(mm, vms->start, vms->end); /* write lock needed */
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
@@ -2907,7 +2908,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
  *
  * This function takes a @mas that is either pointing to the previous VMA or set
  * to MA_START and sets it up to remove the mapping(s).  The @len will be
- * aligned and any arch_unmap work will be preformed.
+ * aligned prior to munmap.
  *
  * Return: 0 on success and drops the lock if so directed, error and leaves the
  * lock held otherwise.
@@ -2927,16 +2928,12 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		return -EINVAL;
 
 	/*
-	 * Check if memory is sealed before arch_unmap.
 	 * Prevent unmapping a sealed VMA.
 	 * can_modify_mm assumes we have acquired the lock on MM.
 	 */
 	if (unlikely(!can_modify_mm(mm, start, end)))
 		return -EPERM;
 
-	 /* arch_unmap() might do unmaps itself.  */
-	arch_unmap(mm, start, end);
-
 	/* Find the first overlapping VMA */
 	vma = vma_find(vmi, end);
 	if (!vma) {
@@ -2997,9 +2994,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (unlikely(!can_modify_mm(mm, addr, end)))
 		return -EPERM;
 
-	 /* arch_unmap() might do unmaps itself.  */
-	arch_unmap(mm, addr, end);
-
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
 	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
@@ -3377,14 +3371,12 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 
 	/*
-	 * Check if memory is sealed before arch_unmap.
 	 * Prevent unmapping a sealed VMA.
 	 * can_modify_mm assumes we have acquired the lock on MM.
 	 */
 	if (unlikely(!can_modify_mm(mm, start, end)))
 		return -EPERM;
 
-	arch_unmap(mm, start, end);
 	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 18/21] mm/mmap: Move can_modify_mm() check down the stack
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (16 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas() Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-22 14:41   ` Lorenzo Stoakes
  2024-07-24 16:29   ` Jeff Xu
  2024-07-17 20:07 ` [PATCH v5 19/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
                   ` (2 subsequent siblings)
  20 siblings, 2 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Without an arch_unmap() call so high in the call stack, the check for
mseal'ed vmas can be moved lower as well.  This has the benefit of only
actually checking if things are msealed when there is anything to check.
That is, we know there is at least one vma that is in the way and needs
to be checked.

Only call the can_modify_mm() in do_vmi_align_munmap() and the MAP_FIXED
case of mmap_region().

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Jeff Xu <jeffxu@chromium.org>
---
 mm/mmap.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 117e8240f697..a32f545d3987 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2877,6 +2877,10 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	struct vma_munmap_struct vms;
 	int error;
 
+	/* Prevent unmapping a sealed VMA. */
+	if (unlikely(!can_modify_mm(mm, start, end)))
+		return -EPERM;
+
 	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
 	error = vms_gather_munmap_vmas(&vms, &mas_detach);
 	if (error)
@@ -2927,13 +2931,6 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 	if (end == start)
 		return -EINVAL;
 
-	/*
-	 * Prevent unmapping a sealed VMA.
-	 * can_modify_mm assumes we have acquired the lock on MM.
-	 */
-	if (unlikely(!can_modify_mm(mm, start, end)))
-		return -EPERM;
-
 	/* Find the first overlapping VMA */
 	vma = vma_find(vmi, end);
 	if (!vma) {
@@ -2991,13 +2988,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
 		return -ENOMEM;
 
-	if (unlikely(!can_modify_mm(mm, addr, end)))
-		return -EPERM;
 
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
 	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
 	if (vma) {
+		/* Prevent unmapping a sealed VMA. */
+		if (unlikely(!can_modify_mm(mm, addr, end)))
+			return -EPERM;
+
 		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
 		mt_on_stack(mt_detach);
 		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
@@ -3370,13 +3369,6 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 
-	/*
-	 * Prevent unmapping a sealed VMA.
-	 * can_modify_mm assumes we have acquired the lock on MM.
-	 */
-	if (unlikely(!can_modify_mm(mm, start, end)))
-		return -EPERM;
-
 	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
 }
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 19/21] ipc/shm, mm: Drop do_vma_munmap()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (17 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-22 14:58   ` Lorenzo Stoakes
  2024-07-17 20:07 ` [PATCH v5 20/21] mm/mmap: Move may_expand_vm() check in mmap_region() Liam R. Howlett
  2024-07-17 20:07 ` [PATCH v5 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
  20 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The do_vma_munmap() wrapper existed for callers that didn't have a vma
iterator and needed to check the vma mseal status prior to calling the
underlying munmap().  All callers now use a vma iterator and since the
mseal check of can_modify_mm() has been moved to do_vmi_align_munmap()
and the vmas are aligned, this function can just be called instead.

do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
it is exported via the mm.h header.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h |  6 +++---
 ipc/shm.c          |  8 ++++----
 mm/mmap.c          | 33 +++++----------------------------
 3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5f1075d19600..49a24c023153 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3406,14 +3406,14 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
 extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 			 unsigned long start, size_t len, struct list_head *uf,
 			 bool unlock);
+extern int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		    struct mm_struct *mm, unsigned long start,
+		    unsigned long end, struct list_head *uf, bool unlock);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
 
 #ifdef CONFIG_MMU
-extern int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
-			 unsigned long start, unsigned long end,
-			 struct list_head *uf, bool unlock);
 extern int __mm_populate(unsigned long addr, unsigned long len,
 			 int ignore_errors);
 static inline void mm_populate(unsigned long addr, unsigned long len)
diff --git a/ipc/shm.c b/ipc/shm.c
index 3e3071252dac..99564c870084 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1778,8 +1778,8 @@ long ksys_shmdt(char __user *shmaddr)
 			 */
 			file = vma->vm_file;
 			size = i_size_read(file_inode(vma->vm_file));
-			do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
-				      NULL, false);
+			do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
+					    vma->vm_end, NULL, false);
 			/*
 			 * We discovered the size of the shm segment, so
 			 * break out of here and fall through to the next
@@ -1803,8 +1803,8 @@ long ksys_shmdt(char __user *shmaddr)
 		if ((vma->vm_ops == &shm_vm_ops) &&
 		    ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
 		    (vma->vm_file == file)) {
-			do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
-				      NULL, false);
+			do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
+					    vma->vm_end, NULL, false);
 		}
 
 		vma = vma_next(&vmi);
diff --git a/mm/mmap.c b/mm/mmap.c
index a32f545d3987..ca752317adef 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -269,11 +269,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 			goto out; /* mapping intersects with an existing non-brk vma. */
 		/*
 		 * mm->brk must be protected by write mmap_lock.
-		 * do_vma_munmap() will drop the lock on success,  so update it
-		 * before calling do_vma_munmap().
+		 * do_vmi_align_munmap() will drop the lock on success,  so
+		 * update it before calling do_vma_munmap().
 		 */
 		mm->brk = brk;
-		if (do_vma_munmap(&vmi, brkvma, newbrk, oldbrk, &uf, true))
+		if (do_vmi_align_munmap(&vmi, brkvma, mm, newbrk, oldbrk, &uf,
+					/* unlock = */ true))
 			goto out;
 
 		goto success_unlocked;
@@ -2865,7 +2866,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
  * Return: 0 on success and drops the lock if so directed, error and leaves the
  * lock held otherwise.
  */
-static int
+int
 do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
 		    unsigned long end, struct list_head *uf, bool unlock)
@@ -3348,30 +3349,6 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	return ret;
 }
 
-/*
- * do_vma_munmap() - Unmap a full or partial vma.
- * @vmi: The vma iterator pointing at the vma
- * @vma: The first vma to be munmapped
- * @start: the start of the address to unmap
- * @end: The end of the address to unmap
- * @uf: The userfaultfd list_head
- * @unlock: Drop the lock on success
- *
- * unmaps a VMA mapping when the vma iterator is already in position.
- * Does not handle alignment.
- *
- * Return: 0 on success drops the lock of so directed, error on failure and will
- * still hold the lock.
- */
-int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, struct list_head *uf,
-		bool unlock)
-{
-	struct mm_struct *mm = vma->vm_mm;
-
-	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
-}
-
 /*
  * do_brk_flags() - Increase the brk vma if the flags match.
  * @vmi: The vma iterator
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 20/21] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (18 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 19/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  2024-07-17 20:07 ` [PATCH v5 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The may_expand_vm() check requires the count of the pages within the
munmap range.  Since this is needed for accounting and obtained later,
the reodering of may_expand_vm() to later in the call stack, after the
vma munmap struct (vms) is initialised and the gather stage is
potentially run, will allow for a single loop over the vmas.  The gather
sage does not commit any work and so everything can be undone in the
case of a failure.

The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
call, so use it instead of looping over the vmas twice.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index ca752317adef..5e74f5cb7be4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -402,27 +402,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
 }
 
-static unsigned long count_vma_pages_range(struct mm_struct *mm,
-		unsigned long addr, unsigned long end,
-		unsigned long *nr_accounted)
-{
-	VMA_ITERATOR(vmi, mm, addr);
-	struct vm_area_struct *vma;
-	unsigned long nr_pages = 0;
-
-	*nr_accounted = 0;
-	for_each_vma_range(vmi, vma, end) {
-		unsigned long vm_start = max(addr, vma->vm_start);
-		unsigned long vm_end = min(end, vma->vm_end);
-
-		nr_pages += PHYS_PFN(vm_end - vm_start);
-		if (vma->vm_flags & VM_ACCOUNT)
-			*nr_accounted += PHYS_PFN(vm_end - vm_start);
-	}
-
-	return nr_pages;
-}
-
 static void __vma_link_file(struct vm_area_struct *vma,
 			    struct address_space *mapping)
 {
@@ -2977,17 +2956,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	pgoff_t vm_pgoff;
 	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
-	unsigned long nr_pages, nr_accounted;
-
-	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
-
-	/*
-	 * Check against address space limit.
-	 * MAP_FIXED may remove pages of mappings that intersects with requested
-	 * mapping. Account for the pages it would unmap.
-	 */
-	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
-		return -ENOMEM;
 
 
 	/* Find the first overlapping VMA */
@@ -3015,6 +2983,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			vma_iter_next_range(&vmi);
 	}
 
+	/* Check against address space limit. */
+	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
+		goto abort_munmap;
+
 	/*
 	 * Private writable mapping: check memory availability
 	 */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v5 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas()
  2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (19 preceding siblings ...)
  2024-07-17 20:07 ` [PATCH v5 20/21] mm/mmap: Move may_expand_vm() check in mmap_region() Liam R. Howlett
@ 2024-07-17 20:07 ` Liam R. Howlett
  20 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-17 20:07 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, Jeff Xu,
	Liam R. Howlett, Lorenzo Stoakes

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The comment has been outdated since 6b73cff239e52 ("mm: change munmap
splitting order and move_vma()").  The move_vma() was altered to fix the
fragile state of the accounting since then.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 5e74f5cb7be4..fa726116346d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2703,13 +2703,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 
 	/*
 	 * If we need to split any vma, do it now to save pain later.
-	 *
-	 * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
-	 * unmapped vm_area_struct will remain in use: so lower split_vma
-	 * places tmp vma above, and higher split_vma places tmp vma below.
+	 * Does it split the first one?
 	 */
-
-	/* Does it split the first one? */
 	if (vms->start > vms->vma->vm_start) {
 
 		/*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 10/21] mm/mmap: Support vma == NULL in init_vma_munmap()
  2024-07-17 20:06 ` [PATCH v5 10/21] mm/mmap: Support vma == NULL in init_vma_munmap() Liam R. Howlett
@ 2024-07-22 13:14   ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2024-07-22 13:14 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, Jeff Xu

On Wed, Jul 17, 2024 at 04:06:58PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Adding support for a NULL vma means the init_vma_munmap() can be
> initialized for a less error-prone process when calling
> vms_complete_munmap_vmas() later on.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a1544a68558e..722bcced0499 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -517,9 +517,14 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
>  {
>  	vms->vmi = vmi;
>  	vms->vma = vma;
> -	vms->mm = vma->vm_mm;
> -	vms->start = start;
> -	vms->end = end;
> +	if (vma) {
> +		vms->mm = vma->vm_mm;
> +		vms->start = start;
> +		vms->end = end;
> +	} else {
> +		vms->mm = NULL;
> +		vms->start = vms->end = 0;
> +	}
>  	vms->unlock = unlock;
>  	vms->uf = uf;
>  	vms->vma_count = 0;
> @@ -2950,11 +2955,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
>  	/* Find the first overlapping VMA */
>  	vma = vma_find(&vmi, end);
> +	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
>  	if (vma) {
>  		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
>  		mt_on_stack(mt_detach);
>  		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> -		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
>  		/* Prepare to unmap any existing mapping in the area */
>  		if (vms_gather_munmap_vmas(&vms, &mas_detach))
>  			return -ENOMEM;
> --
> 2.43.0
>

LGTM,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas()
  2024-07-17 20:07 ` [PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas() Liam R. Howlett
@ 2024-07-22 14:25   ` Lorenzo Stoakes
  2024-07-23 14:11     ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2024-07-22 14:25 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, Jeff Xu, LEROY Christophe, linuxppc-dev, Dave Hansen

On Wed, Jul 17, 2024 at 04:07:05PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The arch_unmap call was previously moved above the rbtree modifications
> in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
> corruption").  The move was motivated by an issue with calling
> arch_unmap() after the rbtree was modified.
>
> Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
> ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
> arch_unmap() prior to modifying the vma tree no longer exists
> (regardless of rbtree or maple tree implementations).
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Cc: LEROY Christophe <christophe.leroy2@cs-soprasteria.com>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Dave Hansen <dave.hansen@intel.com>
> ---
>  mm/mmap.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 9f870e715a47..117e8240f697 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2680,6 +2680,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	mm = vms->mm;
>  	mm->map_count -= vms->vma_count;
>  	mm->locked_vm -= vms->locked_vm;
> +	arch_unmap(mm, vms->start, vms->end); /* write lock needed */

Worth having a mmap_assert_write_locked() here? Would make this
self-documenting also.

>  	if (vms->unlock)
>  		mmap_write_downgrade(mm);
>
> @@ -2907,7 +2908,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>   *
>   * This function takes a @mas that is either pointing to the previous VMA or set
>   * to MA_START and sets it up to remove the mapping(s).  The @len will be
> - * aligned and any arch_unmap work will be preformed.
> + * aligned prior to munmap.
>   *
>   * Return: 0 on success and drops the lock if so directed, error and leaves the
>   * lock held otherwise.
> @@ -2927,16 +2928,12 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  		return -EINVAL;
>
>  	/*
> -	 * Check if memory is sealed before arch_unmap.
>  	 * Prevent unmapping a sealed VMA.
>  	 * can_modify_mm assumes we have acquired the lock on MM.
>  	 */
>  	if (unlikely(!can_modify_mm(mm, start, end)))
>  		return -EPERM;
>
> -	 /* arch_unmap() might do unmaps itself.  */
> -	arch_unmap(mm, start, end);
> -
>  	/* Find the first overlapping VMA */
>  	vma = vma_find(vmi, end);
>  	if (!vma) {
> @@ -2997,9 +2994,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	if (unlikely(!can_modify_mm(mm, addr, end)))
>  		return -EPERM;
>
> -	 /* arch_unmap() might do unmaps itself.  */
> -	arch_unmap(mm, addr, end);
> -

It seems to me that the intent of this particular invocation was to ensure
we have done what we can to unmap before trying to unmap ourselves.

However this seems stupid to me anyway - 'hey maybe the arch will do this
for us' - yeah probably not.

So this should definitely go regardless, given we will invoke it later now
anyway.

>  	/* Find the first overlapping VMA */
>  	vma = vma_find(&vmi, end);
>  	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> @@ -3377,14 +3371,12 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	struct mm_struct *mm = vma->vm_mm;
>
>  	/*
> -	 * Check if memory is sealed before arch_unmap.
>  	 * Prevent unmapping a sealed VMA.
>  	 * can_modify_mm assumes we have acquired the lock on MM.
>  	 */
>  	if (unlikely(!can_modify_mm(mm, start, end)))
>  		return -EPERM;
>
> -	arch_unmap(mm, start, end);
>  	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
>  }
>
> --
> 2.43.0
>

I hope we can find a way to eliminate these kind of hooks altogether as
they reduce our control over how VMA operations are performed.

LGTM,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 18/21] mm/mmap: Move can_modify_mm() check down the stack
  2024-07-17 20:07 ` [PATCH v5 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
@ 2024-07-22 14:41   ` Lorenzo Stoakes
  2024-07-24 16:29   ` Jeff Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2024-07-22 14:41 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, Jeff Xu

On Wed, Jul 17, 2024 at 04:07:06PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Without an arch_unmap() call so high in the call stack, the check for
> mseal'ed vmas can be moved lower as well.  This has the benefit of only
> actually checking if things are msealed when there is anything to check.
> That is, we know there is at least one vma that is in the way and needs
> to be checked.
>
> Only call the can_modify_mm() in do_vmi_align_munmap() and the MAP_FIXED
> case of mmap_region().
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Cc: Jeff Xu <jeffxu@chromium.org>
> ---
>  mm/mmap.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 117e8240f697..a32f545d3987 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2877,6 +2877,10 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	struct vma_munmap_struct vms;
>  	int error;
>
> +	/* Prevent unmapping a sealed VMA. */
> +	if (unlikely(!can_modify_mm(mm, start, end)))
> +		return -EPERM;
> +
>  	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
>  	error = vms_gather_munmap_vmas(&vms, &mas_detach);
>  	if (error)
> @@ -2927,13 +2931,6 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  	if (end == start)
>  		return -EINVAL;
>
> -	/*
> -	 * Prevent unmapping a sealed VMA.
> -	 * can_modify_mm assumes we have acquired the lock on MM.
> -	 */
> -	if (unlikely(!can_modify_mm(mm, start, end)))
> -		return -EPERM;
> -
>  	/* Find the first overlapping VMA */
>  	vma = vma_find(vmi, end);
>  	if (!vma) {
> @@ -2991,13 +2988,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
>  		return -ENOMEM;
>
> -	if (unlikely(!can_modify_mm(mm, addr, end)))
> -		return -EPERM;
>
>  	/* Find the first overlapping VMA */
>  	vma = vma_find(&vmi, end);
>  	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
>  	if (vma) {
> +		/* Prevent unmapping a sealed VMA. */
> +		if (unlikely(!can_modify_mm(mm, addr, end)))
> +			return -EPERM;
> +
>  		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
>  		mt_on_stack(mt_detach);
>  		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> @@ -3370,13 +3369,6 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>
> -	/*
> -	 * Prevent unmapping a sealed VMA.
> -	 * can_modify_mm assumes we have acquired the lock on MM.
> -	 */
> -	if (unlikely(!can_modify_mm(mm, start, end)))
> -		return -EPERM;
> -
>  	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
>  }
>
> --
> 2.43.0
>

LGTM,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 19/21] ipc/shm, mm: Drop do_vma_munmap()
  2024-07-17 20:07 ` [PATCH v5 19/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
@ 2024-07-22 14:58   ` Lorenzo Stoakes
  0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2024-07-22 14:58 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, Jeff Xu

On Wed, Jul 17, 2024 at 04:07:07PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The do_vma_munmap() wrapper existed for callers that didn't have a vma
> iterator and needed to check the vma mseal status prior to calling the
> underlying munmap().  All callers now use a vma iterator and since the
> mseal check of can_modify_mm() has been moved to do_vmi_align_munmap()
> and the vmas are aligned, this function can just be called instead.
>
> do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
> it is exported via the mm.h header.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  include/linux/mm.h |  6 +++---
>  ipc/shm.c          |  8 ++++----
>  mm/mmap.c          | 33 +++++----------------------------
>  3 files changed, 12 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5f1075d19600..49a24c023153 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3406,14 +3406,14 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
>  extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>  			 unsigned long start, size_t len, struct list_head *uf,
>  			 bool unlock);
> +extern int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> +		    struct mm_struct *mm, unsigned long start,
> +		    unsigned long end, struct list_head *uf, bool unlock);

Going to give the same nit as Vlasta gave to me ;) which is that I believe
there's an unwritten rule that we drop the superfluous extern as we go
here.

Obviously this is not very important! :)

>  extern int do_munmap(struct mm_struct *, unsigned long, size_t,
>  		     struct list_head *uf);
>  extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
>
>  #ifdef CONFIG_MMU
> -extern int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> -			 unsigned long start, unsigned long end,
> -			 struct list_head *uf, bool unlock);
>  extern int __mm_populate(unsigned long addr, unsigned long len,
>  			 int ignore_errors);
>  static inline void mm_populate(unsigned long addr, unsigned long len)
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 3e3071252dac..99564c870084 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -1778,8 +1778,8 @@ long ksys_shmdt(char __user *shmaddr)
>  			 */
>  			file = vma->vm_file;
>  			size = i_size_read(file_inode(vma->vm_file));
> -			do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
> -				      NULL, false);
> +			do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
> +					    vma->vm_end, NULL, false);
>  			/*
>  			 * We discovered the size of the shm segment, so
>  			 * break out of here and fall through to the next
> @@ -1803,8 +1803,8 @@ long ksys_shmdt(char __user *shmaddr)
>  		if ((vma->vm_ops == &shm_vm_ops) &&
>  		    ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
>  		    (vma->vm_file == file)) {
> -			do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
> -				      NULL, false);
> +			do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
> +					    vma->vm_end, NULL, false);
>  		}
>
>  		vma = vma_next(&vmi);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index a32f545d3987..ca752317adef 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -269,11 +269,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>  			goto out; /* mapping intersects with an existing non-brk vma. */
>  		/*
>  		 * mm->brk must be protected by write mmap_lock.
> -		 * do_vma_munmap() will drop the lock on success,  so update it
> -		 * before calling do_vma_munmap().
> +		 * do_vmi_align_munmap() will drop the lock on success,  so
> +		 * update it before calling do_vma_munmap().
>  		 */
>  		mm->brk = brk;
> -		if (do_vma_munmap(&vmi, brkvma, newbrk, oldbrk, &uf, true))
> +		if (do_vmi_align_munmap(&vmi, brkvma, mm, newbrk, oldbrk, &uf,
> +					/* unlock = */ true))
>  			goto out;
>
>  		goto success_unlocked;
> @@ -2865,7 +2866,7 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>   * Return: 0 on success and drops the lock if so directed, error and leaves the
>   * lock held otherwise.
>   */
> -static int
> +int
>  do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		    struct mm_struct *mm, unsigned long start,
>  		    unsigned long end, struct list_head *uf, bool unlock)
> @@ -3348,30 +3349,6 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
>  	return ret;
>  }
>
> -/*
> - * do_vma_munmap() - Unmap a full or partial vma.
> - * @vmi: The vma iterator pointing at the vma
> - * @vma: The first vma to be munmapped
> - * @start: the start of the address to unmap
> - * @end: The end of the address to unmap
> - * @uf: The userfaultfd list_head
> - * @unlock: Drop the lock on success
> - *
> - * unmaps a VMA mapping when the vma iterator is already in position.
> - * Does not handle alignment.
> - *
> - * Return: 0 on success drops the lock of so directed, error on failure and will
> - * still hold the lock.
> - */
> -int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> -		unsigned long start, unsigned long end, struct list_head *uf,
> -		bool unlock)
> -{
> -	struct mm_struct *mm = vma->vm_mm;
> -
> -	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
> -}
> -
>  /*
>   * do_brk_flags() - Increase the brk vma if the flags match.
>   * @vmi: The vma iterator
> --
> 2.43.0
>

This is a nice cleanup. LGTM,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-07-17 20:07 ` [PATCH v5 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-07-22 18:42   ` Lorenzo Stoakes
  2024-07-23 14:15     ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2024-07-22 18:42 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, Jeff Xu

On Wed, Jul 17, 2024 at 04:07:02PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Instead of zeroing the vma tree and then overwriting the area, let the
> area be overwritten and then clean up the gathered vmas using
> vms_complete_munmap_vmas().
>
> If a driver is mapping over an existing vma, then clear the ptes before
> the call_mmap() invocation.  If the vma has a vm_ops->close(), then call
> the close() function.  This is done using the vms_clear_ptes() and
> vms_close_vmas() helpers.  This has the side effect of needing to call
> open() on the vmas if the mmap_region() fails later on.
>
> Temporarily keep track of the number of pages that will be removed and
> reduce the charged amount.
>
> This commit drops the validate_mm() call in the vma_expand() function.
> It is necessary to drop the validate as it would fail since the mm
> map_count would be incorrect during a vma expansion, prior to the
> cleanup from vms_complete_munmap_vmas().
>
> Clean up the error handing of the vms_gather_munmap_vmas() by calling
> the verification within the function.
>
> Note that before this change, a MAP_FIXED could fail and leave a gap in
> the vma tree.  With this change, a MAP_FIXED failure will fail without
> creating a gap and leave *a* vma in the area (may have been split) and
> attept to restore them to an operational state (re-attached and
> vm_ops->open()'ed if they were vm_ops->close()'d).
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/internal.h |   2 +
>  mm/mmap.c     | 119 +++++++++++++++++++++++++++++++-------------------
>  2 files changed, 76 insertions(+), 45 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index ec8441362c28..5bd60cb9fcbb 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1503,6 +1503,8 @@ struct vma_munmap_struct {
>  	unsigned long stack_vm;
>  	unsigned long data_vm;
>  	bool unlock;			/* Unlock after the munmap */
> +	bool clear_ptes;		/* If there are outstanding PTE to be cleared */
> +	bool closed;			/* vma->vm_ops->close() called already */
>  };
>
>  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 20da0d039c95..0b7aa2c46cec 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -170,10 +170,11 @@ void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb)
>  /*
>   * Close a vm structure and free it.
>   */
> -static void remove_vma(struct vm_area_struct *vma, bool unreachable)
> +static
> +void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
>  {
>  	might_sleep();
> -	if (vma->vm_ops && vma->vm_ops->close)
> +	if (!closed && vma->vm_ops && vma->vm_ops->close)
>  		vma->vm_ops->close(vma);
>  	if (vma->vm_file)
>  		fput(vma->vm_file);
> @@ -401,17 +402,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
>  }
>
>  static unsigned long count_vma_pages_range(struct mm_struct *mm,
> -		unsigned long addr, unsigned long end)
> +		unsigned long addr, unsigned long end,
> +		unsigned long *nr_accounted)
>  {
>  	VMA_ITERATOR(vmi, mm, addr);
>  	struct vm_area_struct *vma;
>  	unsigned long nr_pages = 0;
>
> +	*nr_accounted = 0;
>  	for_each_vma_range(vmi, vma, end) {
>  		unsigned long vm_start = max(addr, vma->vm_start);
>  		unsigned long vm_end = min(end, vma->vm_end);
>
>  		nr_pages += PHYS_PFN(vm_end - vm_start);
> +		if (vma->vm_flags & VM_ACCOUNT)
> +			*nr_accounted += PHYS_PFN(vm_end - vm_start);
>  	}
>
>  	return nr_pages;
> @@ -527,6 +532,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
>  	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
>  	vms->unmap_start = FIRST_USER_ADDRESS;
>  	vms->unmap_end = USER_PGTABLES_CEILING;
> +	vms->clear_ptes = false;	/* No PTEs to clear yet */
> +	vms->closed = false;
>  }
>
>  /*
> @@ -735,7 +742,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	vma_iter_store(vmi, vma);
>
>  	vma_complete(&vp, vmi, vma->vm_mm);
> -	validate_mm(vma->vm_mm);
>  	return 0;
>
>  nomem:
> @@ -2597,23 +2603,31 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
>   *
>   * Reattach any detached vmas and free up the maple tree used to track the vmas.
>   */
> -static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> +static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
>  {
>  	struct vm_area_struct *vma;
>
>  	mas_set(mas_detach, 0);
> -	mas_for_each(mas_detach, vma, ULONG_MAX)
> +	mas_for_each(mas_detach, vma, ULONG_MAX) {
> +		if (closed && vma->vm_ops && vma->vm_ops->close &&
> +		    vma->vm_ops->open)
> +			vma->vm_ops->open(vma);
> +

Unfortunately I think this is broken. While in theory this should probably
be semantically correct, I think drivers get this wrong.

For instance, in the devio driver, usbdev_vm_ops assigns custom VMA
open/close functionality in usbdev_vm_open() and usbdev_vm_close().

usbdev_vm_open() simply increments a 'vma_use_count' counter, whereas
usbdev_vm_close() calls dec_usb_memory_use_count(), which, if the count
reaches zero, frees a bunch of objects.

I've not tested it directly, but it's conceivable that we could end up with
an entirely broken mapping that might result in a kernel NULL pointer deref
or some such other hideous, possibly exploitable (at least for DoS), scenario.

Also since this is up to drivers, we can't really control whether people do
stupid things here or otherwise assume this close/reopen scenario cannot
happen.

I think the fact we _might_ cause inconsistent kernel state here rules this
out as an approach unfortunately.

We can't simply do what this code did before (that is, leaving a hole) as
this might require allocations to clear a range in the maple tree (as you
pointed out in the open VMA scalability call earlier today).

However, as I suggested in the call, it seems that the case of us
performing a MAP_FIXED mapping _and_ removing underlying VMAs _and_ those
VMAs having custom close() operators is very niche, so in this instance it
seems sensible to simply preallocate memory to allow us this out, and
clearing the range + returning an error if this occurs in this scenario
only.

Since this is such an edge case it'll mean we almost never preallocate,
only doing so in this rare instance.

It's ugly, but it seems there is no really 'pretty' solution to this
problem, and we don't want this to block this series!

>  		vma_mark_detached(vma, false);
> +	}
>
>  	__mt_destroy(mas_detach->tree);
>  }
>
>
> -static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> +static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
>  		struct ma_state *mas_detach, bool mm_wr_locked)
>  {
>  	struct mmu_gather tlb;
>
> +	if (!vms->clear_ptes)
> +		return;
> +
>  	/*
>  	 * We can free page tables without write-locking mmap_lock because VMAs
>  	 * were isolated before we downgraded mmap_lock.
> @@ -2629,6 +2643,23 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
>  	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
>  		      vms->unmap_end, mm_wr_locked);
>  	tlb_finish_mmu(&tlb);
> +	vms->clear_ptes = false;
> +}
> +
> +static inline void
> +vms_close_vmas(struct vma_munmap_struct *vms, struct ma_state *mas_detach)
> +{
> +	struct vm_area_struct *vma;
> +
> +	if (!vms->vma_count)
> +		return;
> +
> +	mas_set(mas_detach, 0);
> +	mas_for_each(mas_detach, vma, ULONG_MAX)
> +		if (vma->vm_ops && vma->vm_ops->close)
> +			vma->vm_ops->close(vma);
> +
> +	vms->closed = true;
>  }
>
>  /*
> @@ -2652,7 +2683,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	if (vms->unlock)
>  		mmap_write_downgrade(mm);
>
> -	vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
> +	vms_clear_ptes(vms, mas_detach, !vms->unlock);
>  	/* Update high watermark before we lower total_vm */
>  	update_hiwater_vm(mm);
>  	/* Stat accounting */
> @@ -2663,7 +2694,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	/* Remove and clean up vmas */
>  	mas_set(mas_detach, 0);
>  	mas_for_each(mas_detach, vma, ULONG_MAX)
> -		remove_vma(vma, false);
> +		remove_vma(vma, /* unreachable = */ false, vms->closed);
>
>  	vm_unacct_memory(vms->nr_accounted);
>  	validate_mm(mm);
> @@ -2804,14 +2835,18 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  	while (vma_iter_addr(vms->vmi) > vms->start)
>  		vma_iter_prev_range(vms->vmi);
>
> +	/* There are now PTEs that need to be cleared */
> +	vms->clear_ptes = true;
> +
>  	return 0;
>
>  userfaultfd_error:
>  munmap_gather_failed:
>  end_split_failed:
> -	abort_munmap_vmas(mas_detach);
> +	abort_munmap_vmas(mas_detach, /* closed = */ false);
>  start_split_failed:
>  map_count_exceeded:
> +	validate_mm(vms->mm);
>  	return error;
>  }
>
> @@ -2855,9 +2890,9 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	return 0;
>
>  clear_tree_failed:
> -	abort_munmap_vmas(&mas_detach);
> -gather_failed:
> +	abort_munmap_vmas(&mas_detach, /* closed = */ false);
>  	validate_mm(mm);
> +gather_failed:
>  	return error;
>  }
>
> @@ -2945,24 +2980,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	unsigned long merge_start = addr, merge_end = end;
>  	bool writable_file_mapping = false;
>  	pgoff_t vm_pgoff;
> -	int error;
> +	int error = -ENOMEM;
>  	VMA_ITERATOR(vmi, mm, addr);
> +	unsigned long nr_pages, nr_accounted;
>
> -	/* Check against address space limit. */
> -	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> -		unsigned long nr_pages;
> -
> -		/*
> -		 * MAP_FIXED may remove pages of mappings that intersects with
> -		 * requested mapping. Account for the pages it would unmap.
> -		 */
> -		nr_pages = count_vma_pages_range(mm, addr, end);
> -
> -		if (!may_expand_vm(mm, vm_flags,
> -					(len >> PAGE_SHIFT) - nr_pages))
> -			return -ENOMEM;
> -	}
> +	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
>
> +	/*
> +	 * Check against address space limit.
> +	 * MAP_FIXED may remove pages of mappings that intersects with requested
> +	 * mapping. Account for the pages it would unmap.
> +	 */
> +	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> +		return -ENOMEM;
>
>  	if (unlikely(!can_modify_mm(mm, addr, end)))
>  		return -EPERM;
> @@ -2979,14 +3009,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
>  		/* Prepare to unmap any existing mapping in the area */
>  		if (vms_gather_munmap_vmas(&vms, &mas_detach))
> -			goto gather_failed;
> -
> -		/* Remove any existing mappings from the vma tree */
> -		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> -			goto clear_tree_failed;
> +			return -ENOMEM;
>
> -		/* Unmap any existing mapping in the area */
> -		vms_complete_munmap_vmas(&vms, &mas_detach);
>  		next = vms.next;
>  		prev = vms.prev;
>  		vma = NULL;
> @@ -3002,8 +3026,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 */
>  	if (accountable_mapping(file, vm_flags)) {
>  		charged = len >> PAGE_SHIFT;
> +		charged -= nr_accounted;
>  		if (security_vm_enough_memory_mm(mm, charged))
> -			return -ENOMEM;
> +			goto abort_munmap;
> +		vms.nr_accounted = 0;
>  		vm_flags |= VM_ACCOUNT;
>  	}
>
> @@ -3052,10 +3078,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 * not unmapped, but the maps are removed from the list.
>  	 */
>  	vma = vm_area_alloc(mm);
> -	if (!vma) {
> -		error = -ENOMEM;
> +	if (!vma)
>  		goto unacct_error;
> -	}
>
>  	vma_iter_config(&vmi, addr, end);
>  	vma_set_range(vma, addr, end, pgoff);
> @@ -3064,6 +3088,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
>  	if (file) {
>  		vma->vm_file = get_file(file);
> +		/* call_mmap() may map PTE, so ensure there are no existing PTEs */
> +		vms_clear_ptes(&vms, &mas_detach, /* mm_wr_locked = */ true);
> +		vms_close_vmas(&vms, &mas_detach);
>  		error = call_mmap(file, vma);
>  		if (error)
>  			goto unmap_and_free_vma;
> @@ -3154,6 +3181,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  expanded:
>  	perf_event_mmap(vma);
>
> +	/* Unmap any existing mapping in the area */
> +	if (vms.nr_pages)
> +		vms_complete_munmap_vmas(&vms, &mas_detach);
> +
>  	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
>  	if (vm_flags & VM_LOCKED) {
>  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> @@ -3201,14 +3232,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  unacct_error:
>  	if (charged)
>  		vm_unacct_memory(charged);
> -	validate_mm(mm);
> -	return error;
>
> -clear_tree_failed:
> -	abort_munmap_vmas(&mas_detach);
> -gather_failed:
> +abort_munmap:
> +	if (vms.nr_pages)
> +		abort_munmap_vmas(&mas_detach, vms.closed);
>  	validate_mm(mm);
> -	return -ENOMEM;
> +	return error;
>  }
>
>  static int __vm_munmap(unsigned long start, size_t len, bool unlock)
> @@ -3549,7 +3578,7 @@ void exit_mmap(struct mm_struct *mm)
>  	do {
>  		if (vma->vm_flags & VM_ACCOUNT)
>  			nr_accounted += vma_pages(vma);
> -		remove_vma(vma, true);
> +		remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
>  		count++;
>  		cond_resched();
>  		vma = vma_next(&vmi);
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas()
  2024-07-22 14:25   ` Lorenzo Stoakes
@ 2024-07-23 14:11     ` Liam R. Howlett
  0 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-23 14:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, Jeff Xu, LEROY Christophe, linuxppc-dev, Dave Hansen

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240722 10:25]:
> On Wed, Jul 17, 2024 at 04:07:05PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > The arch_unmap call was previously moved above the rbtree modifications
> > in commit 5a28fc94c914 ("x86/mpx, mm/core: Fix recursive munmap()
> > corruption").  The move was motivated by an issue with calling
> > arch_unmap() after the rbtree was modified.
> >
> > Since the above commit, mpx was dropped from the kernel in 45fc24e89b7c
> > ("x86/mpx: remove MPX from arch/x86"), so the motivation for calling
> > arch_unmap() prior to modifying the vma tree no longer exists
> > (regardless of rbtree or maple tree implementations).
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Cc: LEROY Christophe <christophe.leroy2@cs-soprasteria.com>
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > ---
> >  mm/mmap.c | 12 ++----------
> >  1 file changed, 2 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 9f870e715a47..117e8240f697 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2680,6 +2680,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> >  	mm = vms->mm;
> >  	mm->map_count -= vms->vma_count;
> >  	mm->locked_vm -= vms->locked_vm;
> > +	arch_unmap(mm, vms->start, vms->end); /* write lock needed */
> 
> Worth having a mmap_assert_write_locked() here? Would make this
> self-documenting also.

No, this is just to point out it cannot be lowered further in this
function.

> 
> >  	if (vms->unlock)
> >  		mmap_write_downgrade(mm);
> >
> > @@ -2907,7 +2908,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >   *
> >   * This function takes a @mas that is either pointing to the previous VMA or set
> >   * to MA_START and sets it up to remove the mapping(s).  The @len will be
> > - * aligned and any arch_unmap work will be preformed.
> > + * aligned prior to munmap.
> >   *
> >   * Return: 0 on success and drops the lock if so directed, error and leaves the
> >   * lock held otherwise.
> > @@ -2927,16 +2928,12 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> >  		return -EINVAL;
> >
> >  	/*
> > -	 * Check if memory is sealed before arch_unmap.
> >  	 * Prevent unmapping a sealed VMA.
> >  	 * can_modify_mm assumes we have acquired the lock on MM.
> >  	 */
> >  	if (unlikely(!can_modify_mm(mm, start, end)))
> >  		return -EPERM;
> >
> > -	 /* arch_unmap() might do unmaps itself.  */
> > -	arch_unmap(mm, start, end);
> > -
> >  	/* Find the first overlapping VMA */
> >  	vma = vma_find(vmi, end);
> >  	if (!vma) {
> > @@ -2997,9 +2994,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> >  		return -EPERM;
> >
> > -	 /* arch_unmap() might do unmaps itself.  */
> > -	arch_unmap(mm, addr, end);
> > -
> 
> It seems to me that the intent of this particular invocation was to ensure
> we have done what we can to unmap before trying to unmap ourselves.
> 
> However this seems stupid to me anyway - 'hey maybe the arch will do this
> for us' - yeah probably not.
> 
> So this should definitely go regardless, given we will invoke it later now
> anyway.

This was covered in the commit message, it was because we needed to
remove the VMAs earlier for a dead feature (mpx).

> 
> >  	/* Find the first overlapping VMA */
> >  	vma = vma_find(&vmi, end);
> >  	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > @@ -3377,14 +3371,12 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	struct mm_struct *mm = vma->vm_mm;
> >
> >  	/*
> > -	 * Check if memory is sealed before arch_unmap.
> >  	 * Prevent unmapping a sealed VMA.
> >  	 * can_modify_mm assumes we have acquired the lock on MM.
> >  	 */
> >  	if (unlikely(!can_modify_mm(mm, start, end)))
> >  		return -EPERM;
> >
> > -	arch_unmap(mm, start, end);
> >  	return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
> >  }
> >
> > --
> > 2.43.0
> >
> 
> I hope we can find a way to eliminate these kind of hooks altogether as
> they reduce our control over how VMA operations are performed.

Agreed.  I see a path forward on doing just that.

> 
> LGTM,
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-07-22 18:42   ` Lorenzo Stoakes
@ 2024-07-23 14:15     ` Liam R. Howlett
  0 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-23 14:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, Jeff Xu

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240722 14:43]:
> On Wed, Jul 17, 2024 at 04:07:02PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Instead of zeroing the vma tree and then overwriting the area, let the
> > area be overwritten and then clean up the gathered vmas using
> > vms_complete_munmap_vmas().
> >
> > If a driver is mapping over an existing vma, then clear the ptes before
> > the call_mmap() invocation.  If the vma has a vm_ops->close(), then call
> > the close() function.  This is done using the vms_clear_ptes() and
> > vms_close_vmas() helpers.  This has the side effect of needing to call
> > open() on the vmas if the mmap_region() fails later on.
> >
> > Temporarily keep track of the number of pages that will be removed and
> > reduce the charged amount.
> >
> > This commit drops the validate_mm() call in the vma_expand() function.
> > It is necessary to drop the validate as it would fail since the mm
> > map_count would be incorrect during a vma expansion, prior to the
> > cleanup from vms_complete_munmap_vmas().
> >
> > Clean up the error handing of the vms_gather_munmap_vmas() by calling
> > the verification within the function.
> >
> > Note that before this change, a MAP_FIXED could fail and leave a gap in
> > the vma tree.  With this change, a MAP_FIXED failure will fail without
> > creating a gap and leave *a* vma in the area (may have been split) and
> > attept to restore them to an operational state (re-attached and
> > vm_ops->open()'ed if they were vm_ops->close()'d).
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/internal.h |   2 +
> >  mm/mmap.c     | 119 +++++++++++++++++++++++++++++++-------------------
> >  2 files changed, 76 insertions(+), 45 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index ec8441362c28..5bd60cb9fcbb 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1503,6 +1503,8 @@ struct vma_munmap_struct {
> >  	unsigned long stack_vm;
> >  	unsigned long data_vm;
> >  	bool unlock;			/* Unlock after the munmap */
> > +	bool clear_ptes;		/* If there are outstanding PTE to be cleared */
> > +	bool closed;			/* vma->vm_ops->close() called already */
> >  };
> >
> >  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 20da0d039c95..0b7aa2c46cec 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -170,10 +170,11 @@ void unlink_file_vma_batch_final(struct unlink_vma_file_batch *vb)
> >  /*
> >   * Close a vm structure and free it.
> >   */
> > -static void remove_vma(struct vm_area_struct *vma, bool unreachable)
> > +static
> > +void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
> >  {
> >  	might_sleep();
> > -	if (vma->vm_ops && vma->vm_ops->close)
> > +	if (!closed && vma->vm_ops && vma->vm_ops->close)
> >  		vma->vm_ops->close(vma);
> >  	if (vma->vm_file)
> >  		fput(vma->vm_file);
> > @@ -401,17 +402,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> >  }
> >
> >  static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > -		unsigned long addr, unsigned long end)
> > +		unsigned long addr, unsigned long end,
> > +		unsigned long *nr_accounted)
> >  {
> >  	VMA_ITERATOR(vmi, mm, addr);
> >  	struct vm_area_struct *vma;
> >  	unsigned long nr_pages = 0;
> >
> > +	*nr_accounted = 0;
> >  	for_each_vma_range(vmi, vma, end) {
> >  		unsigned long vm_start = max(addr, vma->vm_start);
> >  		unsigned long vm_end = min(end, vma->vm_end);
> >
> >  		nr_pages += PHYS_PFN(vm_end - vm_start);
> > +		if (vma->vm_flags & VM_ACCOUNT)
> > +			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> >  	}
> >
> >  	return nr_pages;
> > @@ -527,6 +532,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> >  	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> >  	vms->unmap_start = FIRST_USER_ADDRESS;
> >  	vms->unmap_end = USER_PGTABLES_CEILING;
> > +	vms->clear_ptes = false;	/* No PTEs to clear yet */
> > +	vms->closed = false;
> >  }
> >
> >  /*
> > @@ -735,7 +742,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	vma_iter_store(vmi, vma);
> >
> >  	vma_complete(&vp, vmi, vma->vm_mm);
> > -	validate_mm(vma->vm_mm);
> >  	return 0;
> >
> >  nomem:
> > @@ -2597,23 +2603,31 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
> >   *
> >   * Reattach any detached vmas and free up the maple tree used to track the vmas.
> >   */
> > -static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> > +static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> >  {
> >  	struct vm_area_struct *vma;
> >
> >  	mas_set(mas_detach, 0);
> > -	mas_for_each(mas_detach, vma, ULONG_MAX)
> > +	mas_for_each(mas_detach, vma, ULONG_MAX) {
> > +		if (closed && vma->vm_ops && vma->vm_ops->close &&
> > +		    vma->vm_ops->open)
> > +			vma->vm_ops->open(vma);
> > +
> 
> Unfortunately I think this is broken. While in theory this should probably
> be semantically correct, I think drivers get this wrong.
> 
> For instance, in the devio driver, usbdev_vm_ops assigns custom VMA
> open/close functionality in usbdev_vm_open() and usbdev_vm_close().
> 
> usbdev_vm_open() simply increments a 'vma_use_count' counter, whereas
> usbdev_vm_close() calls dec_usb_memory_use_count(), which, if the count
> reaches zero, frees a bunch of objects.
> 
> I've not tested it directly, but it's conceivable that we could end up with
> an entirely broken mapping that might result in a kernel NULL pointer deref
> or some such other hideous, possibly exploitable (at least for DoS), scenario.
> 
> Also since this is up to drivers, we can't really control whether people do
> stupid things here or otherwise assume this close/reopen scenario cannot
> happen.
> 
> I think the fact we _might_ cause inconsistent kernel state here rules this
> out as an approach unfortunately.
> 
> We can't simply do what this code did before (that is, leaving a hole) as
> this might require allocations to clear a range in the maple tree (as you
> pointed out in the open VMA scalability call earlier today).
> 
> However, as I suggested in the call, it seems that the case of us
> performing a MAP_FIXED mapping _and_ removing underlying VMAs _and_ those
> VMAs having custom close() operators is very niche, so in this instance it
> seems sensible to simply preallocate memory to allow us this out, and
> clearing the range + returning an error if this occurs in this scenario
> only.
> 
> Since this is such an edge case it'll mean we almost never preallocate,
> only doing so in this rare instance.
> 
> It's ugly, but it seems there is no really 'pretty' solution to this
> problem, and we don't want this to block this series!
> 

I will add a patch to this series that will preallocate to leave a gap
(as we do today in the impossible-to-undo failure cases).

Suren suggested I keep the change in its own patch so people see why we
cannot do the Right Thing(tm).

Thanks,
Liam

...


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 18/21] mm/mmap: Move can_modify_mm() check down the stack
  2024-07-17 20:07 ` [PATCH v5 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
  2024-07-22 14:41   ` Lorenzo Stoakes
@ 2024-07-24 16:29   ` Jeff Xu
  1 sibling, 0 replies; 31+ messages in thread
From: Jeff Xu @ 2024-07-24 16:29 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Wed, Jul 17, 2024 at 1:07 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Without an arch_unmap() call so high in the call stack, the check for
> mseal'ed vmas can be moved lower as well.  This has the benefit of only
> actually checking if things are msealed when there is anything to check.
> That is, we know there is at least one vma that is in the way and needs
> to be checked.
>
> Only call the can_modify_mm() in do_vmi_align_munmap() and the MAP_FIXED
> case of mmap_region().
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Cc: Jeff Xu <jeffxu@chromium.org>
Reviewed-by: Jeff Xu <jeffxu@chromium.org>

> ---
>  mm/mmap.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 117e8240f697..a32f545d3987 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2877,6 +2877,10 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>         struct vma_munmap_struct vms;
>         int error;
>
> +       /* Prevent unmapping a sealed VMA. */
> +       if (unlikely(!can_modify_mm(mm, start, end)))
> +               return -EPERM;
> +
It is nice to consolidate the check for do_vmi_align_munmap and
do_vma_munmap to single check.

>         init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
>         error = vms_gather_munmap_vmas(&vms, &mas_detach);
>         if (error)
> @@ -2927,13 +2931,6 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>         if (end == start)
>                 return -EINVAL;
>
> -       /*
> -        * Prevent unmapping a sealed VMA.
> -        * can_modify_mm assumes we have acquired the lock on MM.
> -        */
> -       if (unlikely(!can_modify_mm(mm, start, end)))
> -               return -EPERM;
> -
>         /* Find the first overlapping VMA */
>         vma = vma_find(vmi, end);
>         if (!vma) {
> @@ -2991,13 +2988,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>         if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
>                 return -ENOMEM;
>
> -       if (unlikely(!can_modify_mm(mm, addr, end)))
> -               return -EPERM;
>
>         /* Find the first overlapping VMA */
>         vma = vma_find(&vmi, end);
>         init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
>         if (vma) {
> +               /* Prevent unmapping a sealed VMA. */
> +               if (unlikely(!can_modify_mm(mm, addr, end)))
> +                       return -EPERM;
> +
So the optimization here is : when no vma  found in the given addr
range => no need to call can_modify_mm.

>                 mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
>                 mt_on_stack(mt_detach);
>                 mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> @@ -3370,13 +3369,6 @@ int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  {
>         struct mm_struct *mm = vma->vm_mm;
>
> -       /*
> -        * Prevent unmapping a sealed VMA.
> -        * can_modify_mm assumes we have acquired the lock on MM.
> -        */
> -       if (unlikely(!can_modify_mm(mm, start, end)))
> -               return -EPERM;
> -
>         return do_vmi_align_munmap(vmi, vma, mm, start, end, uf, unlock);
>  }
>
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v5 16/21] mm/mmap: Use vms accounted pages in mmap_region()
  2024-07-17 20:07 ` [PATCH v5 16/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-07-24 20:23   ` Paul Moore
  0 siblings, 0 replies; 31+ messages in thread
From: Paul Moore @ 2024-07-24 20:23 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, Jeff Xu, linux-security-module, Lorenzo Stoakes

On Wed, Jul 17, 2024 at 4:11 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Change from nr_pages variable to vms.nr_accounted for the charged pages
> calculation.  This is necessary for a future patch.
>
> This also avoids checking security_vm_enough_memory_mm() if the amount
> of memory won't change.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: linux-security-module@vger.kernel.org
> Reviewed-by: Kees Cook <kees@kernel.org>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)

Acked-by: Paul Moore <paul@paul-moore.com> (LSM)

-- 
paul-moore.com


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-07-24 20:23 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-17 20:06 [PATCH v5 00/21] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 01/21] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 02/21] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 03/21] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 04/21] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 05/21] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 06/21] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 07/21] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 08/21] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 09/21] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
2024-07-17 20:06 ` [PATCH v5 10/21] mm/mmap: Support vma == NULL in init_vma_munmap() Liam R. Howlett
2024-07-22 13:14   ` Lorenzo Stoakes
2024-07-17 20:06 ` [PATCH v5 11/21] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
2024-07-17 20:07 ` [PATCH v5 12/21] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
2024-07-17 20:07 ` [PATCH v5 13/21] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
2024-07-17 20:07 ` [PATCH v5 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
2024-07-22 18:42   ` Lorenzo Stoakes
2024-07-23 14:15     ` Liam R. Howlett
2024-07-17 20:07 ` [PATCH v5 15/21] mm/mmap: Use PHYS_PFN " Liam R. Howlett
2024-07-17 20:07 ` [PATCH v5 16/21] mm/mmap: Use vms accounted pages " Liam R. Howlett
2024-07-24 20:23   ` Paul Moore
2024-07-17 20:07 ` [PATCH v5 17/21] mm/mmap: Relocate arch_unmap() to vms_complete_munmap_vmas() Liam R. Howlett
2024-07-22 14:25   ` Lorenzo Stoakes
2024-07-23 14:11     ` Liam R. Howlett
2024-07-17 20:07 ` [PATCH v5 18/21] mm/mmap: Move can_modify_mm() check down the stack Liam R. Howlett
2024-07-22 14:41   ` Lorenzo Stoakes
2024-07-24 16:29   ` Jeff Xu
2024-07-17 20:07 ` [PATCH v5 19/21] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
2024-07-22 14:58   ` Lorenzo Stoakes
2024-07-17 20:07 ` [PATCH v5 20/21] mm/mmap: Move may_expand_vm() check in mmap_region() Liam R. Howlett
2024-07-17 20:07 ` [PATCH v5 21/21] mm/mmap: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox