[PATCH v3 00/16] Avoid MAP_FIXED gap exposure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/16] Avoid MAP_FIXED gap exposure
@ 2024-07-04 18:27 Liam R. Howlett
  2024-07-04 18:27 ` [PATCH v3 01/16] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
                   ` (15 more replies)
  0 siblings, 16 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

It is now possible to walk the vma tree using the rcu read locks and is
beneficial to do so to reduce lock contention.  Doing so while a
MAP_FIXED mapping is executing means that a reader may see a gap in the
vma tree that should never logically exist - and does not when using the
mmap lock in read mode.  The temporal gap exists because mmap_region()
calls munmap() prior to installing the new mapping.

This patch set stops rcu readers from seeing the temporal gap by
splitting up the munmap() function into two parts.  The first part
prepares the vma tree for modifications by doing the necessary splits
and tracks the vmas marked for removal in a side tree.  The second part
completes the munmapping of the vmas after the vma tree has been
overwritten (either by a MAP_FIXED replacement vma or by a NULL in the
munmap() case).

Please note that rcu walkers will still be able to see a temporary state
of split vmas that may be in the process of being removed, but the
temporal gap will not be exposed.  vma_start_write() are called on both
parts of the split vma, so this state is detectable.

RFC: https://lore.kernel.org/linux-mm/20240531163217.1584450-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20240611180200.711239-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/all/20240625191145.3382793-1-Liam.Howlett@oracle.com/

Changes since v2:
 - Introduced vms_complete_pte_clear() helper needed in case of modules
   mmaping over MAP_FIXED area.
 - Reduced arguments to unmap_region()
 - Fixed potential free_pgtables() range issue on first/last VMA.
   Tested with stress-ng --vm-sigv

Liam R. Howlett (16):
  mm/mmap: Correctly position vma_iterator in __split_vma()
  mm/mmap: Introduce abort_munmap_vmas()
  mm/mmap: Introduce vmi_complete_munmap_vmas()
  mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  mm/mmap: Change munmap to use vma_munmap_struct() for accounting and
    surrounding vmas
  mm/mmap: Extract validate_mm() from vma_complete()
  mm/mmap: Inline munmap operation in mmap_region()
  mm/mmap: Expand mmap_region() munmap call
  mm/mmap: Reposition vma iterator in mmap_region()
  mm/mmap: Track start and end of munmap in vma_munmap_struct
  mm/mmap: Clean up unmap_region() argument list
  mm/mmap: Avoid zeroing vma tree in mmap_region()
  mm/mmap: Use PHYS_PFN in mmap_region()
  mm/mmap: Use vms accounted pages in mmap_region()
  mm/mmap: Move may_expand_vm() check in mmap_region()

 mm/internal.h |  25 +++
 mm/mmap.c     | 479 +++++++++++++++++++++++++++++++-------------------
 2 files changed, 321 insertions(+), 183 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 01/16] mm/mmap: Correctly position vma_iterator in __split_vma()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-04 18:27 ` [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

The vma iterator may be left pointing to the newly created vma.  This
happens when inserting the new vma at the end of the old vma
(!new_below).

The incorrect position in the vma iterator is not exposed currently
since the vma iterator is repositioned in the munmap path and is not
reused in any of the other paths.

This has limited impact in the current code, but is required for future
changes.

Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/mmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index e42d89f98071..28a46d9ddde0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2414,7 +2414,7 @@ static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
 /*
  * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
  * has already been checked or doesn't make sense to fail.
- * VMA Iterator will point to the end VMA.
+ * VMA Iterator will point to the original vma.
  */
 static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		       unsigned long addr, int new_below)
@@ -2483,6 +2483,9 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	/* Success. */
 	if (new_below)
 		vma_next(vmi);
+	else
+		vma_prev(vmi);
+
 	return 0;
 
 out_free_mpol:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
  2024-07-04 18:27 ` [PATCH v3 01/16] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 17:02   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Extract clean up of failed munmap() operations from
do_vmi_align_munmap().  This simplifies later patches in the series.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 28a46d9ddde0..d572e1ff8255 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2586,6 +2586,25 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
 			 vma->vm_userfaultfd_ctx, anon_vma_name(vma));
 }
 
+/*
+ * abort_munmap_vmas - Undo any munmap work and free resources
+ *
+ * Reattach detached vmas, free up maple tree used to track the vmas.
+ */
+static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+{
+	struct vm_area_struct *vma;
+	int limit;
+
+	limit = mas_detach->index;
+	mas_set(mas_detach, 0);
+	/* Re-attach any detached VMAs */
+	mas_for_each(mas_detach, vma, limit)
+		vma_mark_detached(vma, false);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -2740,11 +2759,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 userfaultfd_error:
 munmap_gather_failed:
 end_split_failed:
-	mas_set(&mas_detach, 0);
-	mas_for_each(&mas_detach, next, end)
-		vma_mark_detached(next, false);
-
-	__mt_destroy(&mt_detach);
+	abort_munmap_vmas(&mas_detach);
 start_split_failed:
 map_count_exceeded:
 	validate_mm(mm);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
  2024-07-04 18:27 ` [PATCH v3 01/16] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
  2024-07-04 18:27 ` [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 17:39   ` Lorenzo Stoakes
  2024-07-10 16:07   ` Suren Baghdasaryan
  2024-07-04 18:27 ` [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Extract all necessary operations that need to be completed after the vma
maple tree is updated from a munmap() operation.  Extracting this makes
the later patch in the series easier to understand.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 79 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 53 insertions(+), 26 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index d572e1ff8255..411798f46932 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2605,6 +2605,56 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 	__mt_destroy(mas_detach->tree);
 }
 
+/*
+ * vmi_complete_munmap_vmas() - Finish the munmap() operation
+ * @vmi: The vma iterator
+ * @start: The start address
+ * @end: The end address
+ * @unlock: Unlock the mm or not
+ * @mas_detach: them maple state of the detached vma maple tree
+ * @locked_vm: The locked_vm count in the detached vmas
+ *
+ * This function updates the mm_struct, unmaps the region, frees the resources
+ * used for the munmap() and may downgrade the lock - if requested.  Everything
+ * needed to be done once the vma maple tree is updated.
+ */
+static void
+vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		struct mm_struct *mm, unsigned long start,
+		unsigned long end, bool unlock, struct ma_state *mas_detach,
+		unsigned long locked_vm)
+{
+	struct vm_area_struct *prev, *next;
+	int count;
+
+	count = mas_detach->index + 1;
+	mm->map_count -= count;
+	mm->locked_vm -= locked_vm;
+	if (unlock)
+		mmap_write_downgrade(mm);
+
+	prev = vma_iter_prev_range(vmi);
+	next = vma_next(vmi);
+	if (next)
+		vma_iter_prev_range(vmi);
+
+	/*
+	 * We can free page tables without write-locking mmap_lock because VMAs
+	 * were isolated before we downgraded mmap_lock.
+	 */
+	mas_set(mas_detach, 1);
+	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
+		     !unlock);
+	/* Statistics and freeing VMAs */
+	mas_set(mas_detach, 0);
+	remove_mt(mm, mas_detach);
+	validate_mm(mm);
+	if (unlock)
+		mmap_read_unlock(mm);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -2624,7 +2674,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
 		    unsigned long end, struct list_head *uf, bool unlock)
 {
-	struct vm_area_struct *prev, *next = NULL;
+	struct vm_area_struct *next = NULL;
 	struct maple_tree mt_detach;
 	int count = 0;
 	int error = -ENOMEM;
@@ -2728,31 +2778,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto clear_tree_failed;
 
 	/* Point of no return */
-	mm->locked_vm -= locked_vm;
-	mm->map_count -= count;
-	if (unlock)
-		mmap_write_downgrade(mm);
-
-	prev = vma_iter_prev_range(vmi);
-	next = vma_next(vmi);
-	if (next)
-		vma_iter_prev_range(vmi);
-
-	/*
-	 * We can free page tables without write-locking mmap_lock because VMAs
-	 * were isolated before we downgraded mmap_lock.
-	 */
-	mas_set(&mas_detach, 1);
-	unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
-		     !unlock);
-	/* Statistics and freeing VMAs */
-	mas_set(&mas_detach, 0);
-	remove_mt(mm, &mas_detach);
-	validate_mm(mm);
-	if (unlock)
-		mmap_read_unlock(mm);
-
-	__mt_destroy(&mt_detach);
+	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
+				 locked_vm);
 	return 0;
 
 clear_tree_failed:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (2 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 18:01   ` Lorenzo Stoakes
  2024-07-10 16:07   ` Suren Baghdasaryan
  2024-07-04 18:27 ` [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
detached maple tree for removal later.  Part of the gathering is the
splitting of vmas that span the boundary.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 82 +++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 58 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 411798f46932..8dc8ffbf9d8d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2656,32 +2656,29 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 }
 
 /*
- * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * for removal at a later date.  Handles splitting first and last if necessary
+ * and marking the vmas as isolated.
+ *
  * @vmi: The vma iterator
  * @vma: The starting vm_area_struct
  * @mm: The mm_struct
  * @start: The aligned start address to munmap.
  * @end: The aligned end address to munmap.
  * @uf: The userfaultfd list_head
- * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
- * success.
+ * @mas_detach: The maple state tracking the detached tree
  *
- * Return: 0 on success and drops the lock if so directed, error and leaves the
- * lock held otherwise.
+ * Return: 0 on success
  */
 static int
-do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
-		    unsigned long end, struct list_head *uf, bool unlock)
+		    unsigned long end, struct list_head *uf,
+		    struct ma_state *mas_detach, unsigned long *locked_vm)
 {
 	struct vm_area_struct *next = NULL;
-	struct maple_tree mt_detach;
 	int count = 0;
 	int error = -ENOMEM;
-	unsigned long locked_vm = 0;
-	MA_STATE(mas_detach, &mt_detach, 0, 0);
-	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
-	mt_on_stack(mt_detach);
 
 	/*
 	 * If we need to split any vma, do it now to save pain later.
@@ -2720,15 +2717,14 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 				goto end_split_failed;
 		}
 		vma_start_write(next);
-		mas_set(&mas_detach, count);
-		error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
+		mas_set(mas_detach, count++);
+		if (next->vm_flags & VM_LOCKED)
+			*locked_vm += vma_pages(next);
+
+		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
 			goto munmap_gather_failed;
 		vma_mark_detached(next, true);
-		if (next->vm_flags & VM_LOCKED)
-			locked_vm += vma_pages(next);
-
-		count++;
 		if (unlikely(uf)) {
 			/*
 			 * If userfaultfd_unmap_prep returns an error the vmas
@@ -2753,7 +2749,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
 	{
-		MA_STATE(test, &mt_detach, 0, 0);
+		MA_STATE(test, mas_detach->tree, 0, 0);
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
@@ -2773,6 +2769,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	while (vma_iter_addr(vmi) > start)
 		vma_iter_prev_range(vmi);
 
+	return 0;
+
+userfaultfd_error:
+munmap_gather_failed:
+end_split_failed:
+	abort_munmap_vmas(mas_detach);
+start_split_failed:
+map_count_exceeded:
+	return error;
+}
+
+/*
+ * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * @vmi: The vma iterator
+ * @vma: The starting vm_area_struct
+ * @mm: The mm_struct
+ * @start: The aligned start address to munmap.
+ * @end: The aligned end address to munmap.
+ * @uf: The userfaultfd list_head
+ * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
+ * success.
+ *
+ * Return: 0 on success and drops the lock if so directed, error and leaves the
+ * lock held otherwise.
+ */
+static int
+do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		    struct mm_struct *mm, unsigned long start,
+		    unsigned long end, struct list_head *uf, bool unlock)
+{
+	struct maple_tree mt_detach;
+	MA_STATE(mas_detach, &mt_detach, 0, 0);
+	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+	mt_on_stack(mt_detach);
+	int error;
+	unsigned long locked_vm = 0;
+
+	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
+				       &mas_detach, &locked_vm);
+	if (error)
+		goto gather_failed;
+
 	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
 	if (error)
 		goto clear_tree_failed;
@@ -2783,12 +2821,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return 0;
 
 clear_tree_failed:
-userfaultfd_error:
-munmap_gather_failed:
-end_split_failed:
 	abort_munmap_vmas(&mas_detach);
-start_split_failed:
-map_count_exceeded:
+gather_failed:
 	validate_mm(mm);
 	return error;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (3 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 18:39   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Use a structure to pass along all the necessary information and counters
involved in removing vmas from the mm_struct.

Update vmi_ function names to vms_ to indicate the first argument
type change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/internal.h |  16 ++++++
 mm/mmap.c     | 137 ++++++++++++++++++++++++++------------------------
 2 files changed, 88 insertions(+), 65 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2ea9a88dcb95..f1e6dea2efcf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1481,6 +1481,22 @@ struct vma_prepare {
 	struct vm_area_struct *remove2;
 };
 
+/*
+ * vma munmap operation
+ */
+struct vma_munmap_struct {
+	struct vma_iterator *vmi;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;	/* The first vma to munmap */
+	struct list_head *uf;		/* Userfaultfd list_head */
+	unsigned long start;		/* Aligned start addr */
+	unsigned long end;		/* Aligned end addr */
+	int vma_count;			/* Number of vmas that will be removed */
+	unsigned long nr_pages;		/* Number of pages being removed */
+	unsigned long locked_vm;	/* Number of locked pages */
+	bool unlock;			/* Unlock after the munmap */
+};
+
 void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 8dc8ffbf9d8d..76e93146ee9d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -500,6 +500,31 @@ static inline void init_vma_prep(struct vma_prepare *vp,
 	init_multi_vma_prep(vp, vma, NULL, NULL, NULL);
 }
 
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation.  Only unlocked on success
+ */
+static inline void init_vma_munmap(struct vma_munmap_struct *vms,
+		struct vma_iterator *vmi, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, struct list_head *uf,
+		bool unlock)
+{
+	vms->vmi = vmi;
+	vms->vma = vma;
+	vms->mm = vma->vm_mm;
+	vms->start = start;
+	vms->end = end;
+	vms->unlock = unlock;
+	vms->uf = uf;
+	vms->vma_count = 0;
+	vms->nr_pages = vms->locked_vm = 0;
+}
 
 /*
  * vma_prepare() - Helper function for handling locking VMAs prior to altering
@@ -2606,78 +2631,59 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 }
 
 /*
- * vmi_complete_munmap_vmas() - Finish the munmap() operation
- * @vmi: The vma iterator
- * @start: The start address
- * @end: The end address
- * @unlock: Unlock the mm or not
- * @mas_detach: them maple state of the detached vma maple tree
- * @locked_vm: The locked_vm count in the detached vmas
- *
- * This function updates the mm_struct, unmaps the region, frees the resources
- * used for the munmap() and may downgrade the lock - if requested.  Everything
- * needed to be done once the vma maple tree is updated.
+ * vms_complete_munmap_vmas() - Finish the munmap() operation
+ * @vms: The vma munmap struct
+ * @mas_detach: The maple state of the detached vmas
  */
-static void
-vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		struct mm_struct *mm, unsigned long start,
-		unsigned long end, bool unlock, struct ma_state *mas_detach,
-		unsigned long locked_vm)
+
+static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *prev, *next;
-	int count;
+	struct mm_struct *mm;
 
-	count = mas_detach->index + 1;
-	mm->map_count -= count;
-	mm->locked_vm -= locked_vm;
-	if (unlock)
+	mm = vms->mm;
+	mm->map_count -= vms->vma_count;
+	mm->locked_vm -= vms->locked_vm;
+	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	prev = vma_iter_prev_range(vmi);
-	next = vma_next(vmi);
+	prev = vma_iter_prev_range(vms->vmi);
+	next = vma_next(vms->vmi);
 	if (next)
-		vma_iter_prev_range(vmi);
+		vma_iter_prev_range(vms->vmi);
 
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
 	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
-		     !unlock);
+	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
+		     vms->vma_count, !vms->unlock);
 	/* Statistics and freeing VMAs */
 	mas_set(mas_detach, 0);
 	remove_mt(mm, mas_detach);
 	validate_mm(mm);
-	if (unlock)
+	if (vms->unlock)
 		mmap_read_unlock(mm);
 
 	__mt_destroy(mas_detach->tree);
 }
 
 /*
- * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
  * for removal at a later date.  Handles splitting first and last if necessary
  * and marking the vmas as isolated.
  *
- * @vmi: The vma iterator
- * @vma: The starting vm_area_struct
- * @mm: The mm_struct
- * @start: The aligned start address to munmap.
- * @end: The aligned end address to munmap.
- * @uf: The userfaultfd list_head
+ * @vms: The vma munmap struct
  * @mas_detach: The maple state tracking the detached tree
  *
  * Return: 0 on success
  */
-static int
-vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		    struct mm_struct *mm, unsigned long start,
-		    unsigned long end, struct list_head *uf,
-		    struct ma_state *mas_detach, unsigned long *locked_vm)
+static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *next = NULL;
-	int count = 0;
 	int error = -ENOMEM;
 
 	/*
@@ -2689,17 +2695,18 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 */
 
 	/* Does it split the first one? */
-	if (start > vma->vm_start) {
+	if (vms->start > vms->vma->vm_start) {
 
 		/*
 		 * Make sure that map_count on return from munmap() will
 		 * not exceed its limit; but let map_count go just above
 		 * its limit temporarily, to help free resources as expected.
 		 */
-		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+		if (vms->end < vms->vma->vm_end &&
+		    vms->mm->map_count >= sysctl_max_map_count)
 			goto map_count_exceeded;
 
-		error = __split_vma(vmi, vma, start, 1);
+		error = __split_vma(vms->vmi, vms->vma, vms->start, 1);
 		if (error)
 			goto start_split_failed;
 	}
@@ -2708,24 +2715,24 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	next = vma;
+	next = vms->vma;
 	do {
 		/* Does it split the end? */
-		if (next->vm_end > end) {
-			error = __split_vma(vmi, next, end, 0);
+		if (next->vm_end > vms->end) {
+			error = __split_vma(vms->vmi, next, vms->end, 0);
 			if (error)
 				goto end_split_failed;
 		}
 		vma_start_write(next);
-		mas_set(mas_detach, count++);
+		mas_set(mas_detach, vms->vma_count++);
 		if (next->vm_flags & VM_LOCKED)
-			*locked_vm += vma_pages(next);
+			vms->locked_vm += vma_pages(next);
 
 		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
 			goto munmap_gather_failed;
 		vma_mark_detached(next, true);
-		if (unlikely(uf)) {
+		if (unlikely(vms->uf)) {
 			/*
 			 * If userfaultfd_unmap_prep returns an error the vmas
 			 * will remain split, but userland will get a
@@ -2735,16 +2742,17 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			 * split, despite we could. This is unlikely enough
 			 * failure that it's not worth optimizing it for.
 			 */
-			error = userfaultfd_unmap_prep(next, start, end, uf);
+			error = userfaultfd_unmap_prep(next, vms->start,
+						       vms->end, vms->uf);
 
 			if (error)
 				goto userfaultfd_error;
 		}
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
-		BUG_ON(next->vm_start < start);
-		BUG_ON(next->vm_start > end);
+		BUG_ON(next->vm_start < vms->start);
+		BUG_ON(next->vm_start > vms->end);
 #endif
-	} for_each_vma_range(*vmi, next, end);
+	} for_each_vma_range(*(vms->vmi), next, vms->end);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
@@ -2753,21 +2761,21 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
-		vma_iter_set(vmi, start);
+		vma_iter_set(vms->vmi, vms->start);
 		rcu_read_lock();
-		vma_test = mas_find(&test, count - 1);
-		for_each_vma_range(*vmi, vma_mas, end) {
+		vma_test = mas_find(&test, vms->vma_count - 1);
+		for_each_vma_range(*(vms->vmi), vma_mas, vms->end) {
 			BUG_ON(vma_mas != vma_test);
 			test_count++;
-			vma_test = mas_next(&test, count - 1);
+			vma_test = mas_next(&test, vms->vma_count - 1);
 		}
 		rcu_read_unlock();
-		BUG_ON(count != test_count);
+		BUG_ON(vms->vma_count != test_count);
 	}
 #endif
 
-	while (vma_iter_addr(vmi) > start)
-		vma_iter_prev_range(vmi);
+	while (vma_iter_addr(vms->vmi) > vms->start)
+		vma_iter_prev_range(vms->vmi);
 
 	return 0;
 
@@ -2803,11 +2811,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	MA_STATE(mas_detach, &mt_detach, 0, 0);
 	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
 	mt_on_stack(mt_detach);
+	struct vma_munmap_struct vms;
 	int error;
-	unsigned long locked_vm = 0;
 
-	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
-				       &mas_detach, &locked_vm);
+	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
+	error = vms_gather_munmap_vmas(&vms, &mas_detach);
 	if (error)
 		goto gather_failed;
 
@@ -2816,8 +2824,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto clear_tree_failed;
 
 	/* Point of no return */
-	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
-				 locked_vm);
+	vms_complete_munmap_vmas(&vms, &mas_detach);
 	return 0;
 
 clear_tree_failed:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (4 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 19:27   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 07/16] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Clean up the code by changing the munmap operation to use a structure
for the accounting and munmap variables.

Since remove_mt() is only called in one location and the contents will
be reduce to almost nothing.  The remains of the function can be added
to vms_complete_munmap_vmas().

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/internal.h |  6 ++++
 mm/mmap.c     | 81 ++++++++++++++++++++++++++-------------------------
 2 files changed, 47 insertions(+), 40 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f1e6dea2efcf..8cbbbe7d40f3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1488,12 +1488,18 @@ struct vma_munmap_struct {
 	struct vma_iterator *vmi;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;	/* The first vma to munmap */
+	struct vm_area_struct *next;	/* vma after the munmap area */
+	struct vm_area_struct *prev;    /* vma before the munmap area */
 	struct list_head *uf;		/* Userfaultfd list_head */
 	unsigned long start;		/* Aligned start addr */
 	unsigned long end;		/* Aligned end addr */
 	int vma_count;			/* Number of vmas that will be removed */
 	unsigned long nr_pages;		/* Number of pages being removed */
 	unsigned long locked_vm;	/* Number of locked pages */
+	unsigned long nr_accounted;	/* Number of VM_ACCOUNT pages */
+	unsigned long exec_vm;
+	unsigned long stack_vm;
+	unsigned long data_vm;
 	bool unlock;			/* Unlock after the munmap */
 };
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 76e93146ee9d..2a1a49f98fa3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -523,7 +523,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->unlock = unlock;
 	vms->uf = uf;
 	vms->vma_count = 0;
-	vms->nr_pages = vms->locked_vm = 0;
+	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 }
 
 /*
@@ -2388,30 +2389,6 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
 	return vma;
 }
 
-/*
- * Ok - we have the memory areas we should free on a maple tree so release them,
- * and do the vma updates.
- *
- * Called with the mm semaphore held.
- */
-static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
-{
-	unsigned long nr_accounted = 0;
-	struct vm_area_struct *vma;
-
-	/* Update high watermark before we lower total_vm */
-	update_hiwater_vm(mm);
-	mas_for_each(mas, vma, ULONG_MAX) {
-		long nrpages = vma_pages(vma);
-
-		if (vma->vm_flags & VM_ACCOUNT)
-			nr_accounted += nrpages;
-		vm_stat_account(mm, vma->vm_flags, -nrpages);
-		remove_vma(vma, false);
-	}
-	vm_unacct_memory(nr_accounted);
-}
-
 /*
  * Get rid of page table information in the indicated region.
  *
@@ -2634,12 +2611,15 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
  * vms_complete_munmap_vmas() - Finish the munmap() operation
  * @vms: The vma munmap struct
  * @mas_detach: The maple state of the detached vmas
+ *
+ * This function updates the mm_struct, unmaps the region, frees the resources
+ * used for the munmap() and may downgrade the lock - if requested.  Everything
+ * needed to be done once the vma maple tree is updated.
  */
-
 static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
-	struct vm_area_struct *prev, *next;
+	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 
 	mm = vms->mm;
@@ -2648,21 +2628,26 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	prev = vma_iter_prev_range(vms->vmi);
-	next = vma_next(vms->vmi);
-	if (next)
-		vma_iter_prev_range(vms->vmi);
-
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
 	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
-		     vms->vma_count, !vms->unlock);
-	/* Statistics and freeing VMAs */
+	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
+		     vms->start, vms->end, vms->vma_count, !vms->unlock);
+	/* Update high watermark before we lower total_vm */
+	update_hiwater_vm(mm);
+	/* Stat accounting */
+	WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
+	mm->exec_vm -= vms->exec_vm;
+	mm->stack_vm -= vms->stack_vm;
+	mm->data_vm -= vms->data_vm;
+	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
-	remove_mt(mm, mas_detach);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		remove_vma(vma, false);
+
+	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
 	if (vms->unlock)
 		mmap_read_unlock(mm);
@@ -2710,13 +2695,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (error)
 			goto start_split_failed;
 	}
+	vms->prev = vma_prev(vms->vmi);
 
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	next = vms->vma;
-	do {
+	for_each_vma_range(*(vms->vmi), next, vms->end) {
+		long nrpages;
 		/* Does it split the end? */
 		if (next->vm_end > vms->end) {
 			error = __split_vma(vms->vmi, next, vms->end, 0);
@@ -2725,8 +2711,21 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		}
 		vma_start_write(next);
 		mas_set(mas_detach, vms->vma_count++);
+		nrpages = vma_pages(next);
+
+		vms->nr_pages += nrpages;
 		if (next->vm_flags & VM_LOCKED)
-			vms->locked_vm += vma_pages(next);
+			vms->locked_vm += nrpages;
+
+		if (next->vm_flags & VM_ACCOUNT)
+			vms->nr_accounted += nrpages;
+
+		if (is_exec_mapping(next->vm_flags))
+			vms->exec_vm += nrpages;
+		else if (is_stack_mapping(next->vm_flags))
+			vms->stack_vm += nrpages;
+		else if (is_data_mapping(next->vm_flags))
+			vms->data_vm += nrpages;
 
 		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
@@ -2752,7 +2751,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		BUG_ON(next->vm_start < vms->start);
 		BUG_ON(next->vm_start > vms->end);
 #endif
-	} for_each_vma_range(*(vms->vmi), next, vms->end);
+	}
+
+	vms->next = vma_next(vms->vmi);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 07/16] mm/mmap: Extract validate_mm() from vma_complete()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (5 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 19:35   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

vma_complete() will need to be called during an unsafe time to call
validate_mm().  Extract the call in all places now so that only one
location can be modified in the next change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 2a1a49f98fa3..8d9be791997a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -646,7 +646,6 @@ static inline void vma_complete(struct vma_prepare *vp,
 	}
 	if (vp->insert && vp->file)
 		uprobe_mmap(vp->insert);
-	validate_mm(mm);
 }
 
 /*
@@ -734,6 +733,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);
 
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 	return 0;
 
 nomem:
@@ -775,6 +775,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_clear(vmi);
 	vma_set_range(vma, start, end, pgoff);
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 	return 0;
 }
 
@@ -1103,6 +1104,7 @@ static struct vm_area_struct
 	}
 
 	vma_complete(&vp, vmi, mm);
+	validate_mm(mm);
 	khugepaged_enter_vma(res, vm_flags);
 	return res;
 
@@ -2481,6 +2483,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	/* vma_complete stores the new vma */
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 
 	/* Success. */
 	if (new_below)
@@ -3353,6 +3356,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		vma_iter_store(vmi, vma);
 
 		vma_complete(&vp, vmi, mm);
+		validate_mm(mm);
 		khugepaged_enter_vma(vma, flags);
 		goto out;
 	}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (6 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 07/16] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 19:39   ` Lorenzo Stoakes
  2024-07-10 16:15   ` Suren Baghdasaryan
  2024-07-04 18:27 ` [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

mmap_region is already passed sanitized addr and len, so change the
call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
checks.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8d9be791997a..e9858ca8bbd4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2937,12 +2937,20 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			return -ENOMEM;
 	}
 
-	/* Unmap any existing mapping in the area */
-	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
-	if (error == -EPERM)
-		return error;
-	else if (error)
-		return -ENOMEM;
+
+	if (unlikely(!can_modify_mm(mm, addr, end)))
+		return -EPERM;
+
+	 /* arch_unmap() might do unmaps itself.  */
+	arch_unmap(mm, addr, end);
+
+	/* Find the first overlapping VMA */
+	vma = vma_find(&vmi, end);
+	if (vma) {
+		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+			return -ENOMEM;
+		vma = NULL;
+	}
 
 	/*
 	 * Private writable mapping: check memory availability
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (7 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 20:06   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Open code the do_vmi_align_munmap() call so that it can be broken up
later in the series.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index e9858ca8bbd4..f5b33de4e717 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2915,6 +2915,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct vm_area_struct *next, *prev, *merge;
 	pgoff_t pglen = len >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
+	struct maple_tree mt_detach;
 	unsigned long end = addr + len;
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
@@ -2947,9 +2950,24 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
 	if (vma) {
-		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(mt_detach);
+		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
+		if (vms_gather_munmap_vmas(&vms, &mas_detach))
+			return -ENOMEM;
+
+		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
 			return -ENOMEM;
+
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+		next = vms.next;
+		prev = vms.prev;
+		vma_prev(&vmi);
 		vma = NULL;
+	} else {
+		next = vma_next(&vmi);
+		prev = vma_prev(&vmi);
 	}
 
 	/*
@@ -2962,8 +2980,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	next = vma_next(&vmi);
-	prev = vma_prev(&vmi);
 	if (vm_flags & VM_SPECIAL) {
 		if (prev)
 			vma_iter_next_range(&vmi);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (8 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 20:18   ` Lorenzo Stoakes
  2024-07-10 16:48   ` Suren Baghdasaryan
  2024-07-04 18:27 ` [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of moving (or leaving) the vma iterator pointing at the previous
vma, leave it pointing at the insert location.  Pointing the vma
iterator at the insert location allows for a cleaner walk of the vma
tree for MAP_FIXED and the no expansion cases.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f5b33de4e717..ecf55d32e804 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2963,11 +2963,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vms_complete_munmap_vmas(&vms, &mas_detach);
 		next = vms.next;
 		prev = vms.prev;
-		vma_prev(&vmi);
 		vma = NULL;
 	} else {
 		next = vma_next(&vmi);
 		prev = vma_prev(&vmi);
+		if (prev)
+			vma_iter_next_range(&vmi);
 	}
 
 	/*
@@ -2980,11 +2981,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	if (vm_flags & VM_SPECIAL) {
-		if (prev)
-			vma_iter_next_range(&vmi);
+	if (vm_flags & VM_SPECIAL)
 		goto cannot_expand;
-	}
 
 	/* Attempt to expand an old mapping */
 	/* Check next */
@@ -3005,19 +3003,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		merge_start = prev->vm_start;
 		vma = prev;
 		vm_pgoff = prev->vm_pgoff;
-	} else if (prev) {
-		vma_iter_next_range(&vmi);
+		vma_prev(&vmi);
 	}
 
-	/* Actually expand, if possible */
-	if (vma &&
-	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
-		khugepaged_enter_vma(vma, vm_flags);
-		goto expanded;
+	if (vma) {
+		/* Actually expand, if possible */
+		if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
+			khugepaged_enter_vma(vma, vm_flags);
+			goto expanded;
+		}
+
+		/* If the expand fails, then reposition the vma iterator */
+		if (unlikely(vma == prev))
+			vma_iter_set(&vmi, addr);
 	}
 
-	if (vma == prev)
-		vma_iter_set(&vmi, addr);
 cannot_expand:
 
 	/*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (9 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 20:27   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 12/16] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Set the start and end address for munmap when the prev and next are
gathered.  This is needed to avoid incorrect addresses being used during
the vms_complete_munmap_vmas() function if the prev/next vma are
expanded.

Add a new helper vms_complete_pte_clear(), which is needed later and
will avoid growing the argument list to unmap_region() beyond the 9 it
already has.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/internal.h |  2 ++
 mm/mmap.c     | 34 +++++++++++++++++++++++++++-------
 2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 8cbbbe7d40f3..4c9f06669cc4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1493,6 +1493,8 @@ struct vma_munmap_struct {
 	struct list_head *uf;		/* Userfaultfd list_head */
 	unsigned long start;		/* Aligned start addr */
 	unsigned long end;		/* Aligned end addr */
+	unsigned long unmap_start;
+	unsigned long unmap_end;
 	int vma_count;			/* Number of vmas that will be removed */
 	unsigned long nr_pages;		/* Number of pages being removed */
 	unsigned long locked_vm;	/* Number of locked pages */
diff --git a/mm/mmap.c b/mm/mmap.c
index ecf55d32e804..45443a53be76 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -525,6 +525,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->vma_count = 0;
 	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+	vms->unmap_start = FIRST_USER_ADDRESS;
+	vms->unmap_end = USER_PGTABLES_CEILING;
 }
 
 /*
@@ -2610,6 +2612,26 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 	__mt_destroy(mas_detach->tree);
 }
 
+
+static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach, bool mm_wr_locked)
+{
+	struct mmu_gather tlb;
+
+	/*
+	 * We can free page tables without write-locking mmap_lock because VMAs
+	 * were isolated before we downgraded mmap_lock.
+	 */
+	mas_set(mas_detach, 1);
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, vms->mm);
+	update_hiwater_rss(vms->mm);
+	unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
+	mas_set(mas_detach, 1);
+	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
+	tlb_finish_mmu(&tlb);
+}
+
 /*
  * vms_complete_munmap_vmas() - Finish the munmap() operation
  * @vms: The vma munmap struct
@@ -2631,13 +2653,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	/*
-	 * We can free page tables without write-locking mmap_lock because VMAs
-	 * were isolated before we downgraded mmap_lock.
-	 */
-	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
-		     vms->start, vms->end, vms->vma_count, !vms->unlock);
+	vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
 	/* Stat accounting */
@@ -2699,6 +2715,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			goto start_split_failed;
 	}
 	vms->prev = vma_prev(vms->vmi);
+	if (vms->prev)
+		vms->unmap_start = vms->prev->vm_end;
 
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
@@ -2757,6 +2775,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	}
 
 	vms->next = vma_next(vms->vmi);
+	if (vms->next)
+		vms->unmap_end = vms->next->vm_start;
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 12/16] mm/mmap: Clean up unmap_region() argument list
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (10 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-05 20:33   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

With the only caller to unmap_region() being the error path of
mmap_region(), the argument list can be significantly reduced.

There is also no need to forward declare the static function any
longer.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 25 +++++++++----------------
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 45443a53be76..5d458c5f080e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -76,11 +76,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
-static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
-		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		struct vm_area_struct *next, unsigned long start,
-		unsigned long end, unsigned long tree_end, bool mm_wr_locked);
-
 static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
 {
 	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
@@ -2398,22 +2393,21 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
  *
  * Called with the mm semaphore held.
  */
-static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
-		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		struct vm_area_struct *next, unsigned long start,
-		unsigned long end, unsigned long tree_end, bool mm_wr_locked)
+static void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+		struct vm_area_struct *prev, struct vm_area_struct *next)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_gather tlb;
-	unsigned long mt_start = mas->index;
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
-	mas_set(mas, mt_start);
+	unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
+		   /* mm_wr_locked = */ true);
+	mas_set(mas, vma->vm_end);
 	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
-				 next ? next->vm_start : USER_PGTABLES_CEILING,
-				 mm_wr_locked);
+		      next ? next->vm_start : USER_PGTABLES_CEILING,
+		      /* mm_wr_locked = */ true);
 	tlb_finish_mmu(&tlb);
 }
 
@@ -3186,8 +3180,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 		vma_iter_set(&vmi, vma->vm_end);
 		/* Undo any partial mapping done by a device driver. */
-		unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
-			     vma->vm_end, vma->vm_end, true);
+		unmap_region(&vmi.mas, vma, prev, next);
 	}
 	if (writable_file_mapping)
 		mapping_unmap_writable(file->f_mapping);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (11 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 12/16] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-08 12:18   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 14/16] mm/mmap: Use PHYS_PFN " Liam R. Howlett
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of zeroing the vma tree and then overwriting the area, let the
area be overwritten and then clean up the gathered vmas using
vms_complete_munmap_vmas().

In the case of a driver mapping over existing vmas, the PTEs are cleared
using the helper vms_complete_pte_clear().

Temporarily keep track of the number of pages that will be removed and
reduce the charged amount.

This also drops the validate_mm() call in the vma_expand() function.
It is necessary to drop the validate as it would fail since the mm
map_count would be incorrect during a vma expansion, prior to the
cleanup from vms_complete_munmap_vmas().

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/internal.h |  1 +
 mm/mmap.c     | 61 ++++++++++++++++++++++++++++++---------------------
 2 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4c9f06669cc4..fae4a1bba732 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
 	unsigned long stack_vm;
 	unsigned long data_vm;
 	bool unlock;			/* Unlock after the munmap */
+	bool cleared_ptes;		/* If the PTE are cleared already */
 };
 
 void __meminit __init_single_page(struct page *page, unsigned long pfn,
diff --git a/mm/mmap.c b/mm/mmap.c
index 5d458c5f080e..0c334eeae8cd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -401,17 +401,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 }
 
 static unsigned long count_vma_pages_range(struct mm_struct *mm,
-		unsigned long addr, unsigned long end)
+		unsigned long addr, unsigned long end,
+		unsigned long *nr_accounted)
 {
 	VMA_ITERATOR(vmi, mm, addr);
 	struct vm_area_struct *vma;
 	unsigned long nr_pages = 0;
 
+	*nr_accounted = 0;
 	for_each_vma_range(vmi, vma, end) {
 		unsigned long vm_start = max(addr, vma->vm_start);
 		unsigned long vm_end = min(end, vma->vm_end);
 
 		nr_pages += PHYS_PFN(vm_end - vm_start);
+		if (vma->vm_flags & VM_ACCOUNT)
+			*nr_accounted += PHYS_PFN(vm_end - vm_start);
 	}
 
 	return nr_pages;
@@ -522,6 +526,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 	vms->unmap_start = FIRST_USER_ADDRESS;
 	vms->unmap_end = USER_PGTABLES_CEILING;
+	vms->cleared_ptes = false;
 }
 
 /*
@@ -730,7 +735,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);
 
 	vma_complete(&vp, vmi, vma->vm_mm);
-	validate_mm(vma->vm_mm);
 	return 0;
 
 nomem:
@@ -2612,6 +2616,9 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
 {
 	struct mmu_gather tlb;
 
+	if (vms->cleared_ptes)
+		return;
+
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
@@ -2624,6 +2631,7 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
 	mas_set(mas_detach, 1);
 	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
 	tlb_finish_mmu(&tlb);
+	vms->cleared_ptes = true;
 }
 
 /*
@@ -2936,24 +2944,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
 	pgoff_t vm_pgoff;
-	int error;
+	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
+	unsigned long nr_pages, nr_accounted;
 
-	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
-		unsigned long nr_pages;
-
-		/*
-		 * MAP_FIXED may remove pages of mappings that intersects with
-		 * requested mapping. Account for the pages it would unmap.
-		 */
-		nr_pages = count_vma_pages_range(mm, addr, end);
-
-		if (!may_expand_vm(mm, vm_flags,
-					(len >> PAGE_SHIFT) - nr_pages))
-			return -ENOMEM;
-	}
+	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
 
+	/* Check against address space limit. */
+	/*
+	 * MAP_FIXED may remove pages of mappings that intersects with requested
+	 * mapping. Account for the pages it would unmap.
+	 */
+	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
 		return -EPERM;
@@ -2971,14 +2974,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		if (vms_gather_munmap_vmas(&vms, &mas_detach))
 			return -ENOMEM;
 
-		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
-			return -ENOMEM;
-
-		vms_complete_munmap_vmas(&vms, &mas_detach);
 		next = vms.next;
 		prev = vms.prev;
 		vma = NULL;
 	} else {
+		/* Minimal setup of vms */
+		vms.nr_pages = 0;
 		next = vma_next(&vmi);
 		prev = vma_prev(&vmi);
 		if (prev)
@@ -2990,8 +2991,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	if (accountable_mapping(file, vm_flags)) {
 		charged = len >> PAGE_SHIFT;
+		charged -= nr_accounted;
 		if (security_vm_enough_memory_mm(mm, charged))
-			return -ENOMEM;
+			goto abort_munmap;
+		vms.nr_accounted = 0;
 		vm_flags |= VM_ACCOUNT;
 	}
 
@@ -3040,10 +3043,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * not unmapped, but the maps are removed from the list.
 	 */
 	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
+	if (!vma)
 		goto unacct_error;
-	}
 
 	vma_iter_config(&vmi, addr, end);
 	vma_set_range(vma, addr, end, pgoff);
@@ -3052,6 +3053,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	if (file) {
 		vma->vm_file = get_file(file);
+		/* call_mmap() map PTE, so ensure there are no existing PTEs */
+		if (vms.nr_pages)
+			vms_complete_pte_clear(&vms, &mas_detach, true);
 		error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
@@ -3142,6 +3146,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 expanded:
 	perf_event_mmap(vma);
 
+	if (vms.nr_pages)
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
@@ -3189,6 +3196,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
+
+abort_munmap:
+	if (vms.nr_pages)
+		abort_munmap_vmas(&mas_detach);
 	validate_mm(mm);
 	return error;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 14/16] mm/mmap: Use PHYS_PFN in mmap_region()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (12 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-08 12:21   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 15/16] mm/mmap: Use vms accounted pages " Liam R. Howlett
  2024-07-04 18:27 ` [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check " Liam R. Howlett
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of shifting the length by PAGE_SIZE, use PHYS_PFN.  Also use the
existing local variable everywhere instead of some of the time.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 0c334eeae8cd..b14da6bd257f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2935,7 +2935,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	struct vm_area_struct *next, *prev, *merge;
-	pgoff_t pglen = len >> PAGE_SHIFT;
+	pgoff_t pglen = PHYS_PFN(len);
 	unsigned long charged = 0;
 	struct vma_munmap_struct vms;
 	struct ma_state mas_detach;
@@ -2955,7 +2955,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * MAP_FIXED may remove pages of mappings that intersects with requested
 	 * mapping. Account for the pages it would unmap.
 	 */
-	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
 		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
@@ -2990,7 +2990,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * Private writable mapping: check memory availability
 	 */
 	if (accountable_mapping(file, vm_flags)) {
-		charged = len >> PAGE_SHIFT;
+		charged = pglen;
 		charged -= nr_accounted;
 		if (security_vm_enough_memory_mm(mm, charged))
 			goto abort_munmap;
@@ -3149,14 +3149,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (vms.nr_pages)
 		vms_complete_munmap_vmas(&vms, &mas_detach);
 
-	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
+	vm_stat_account(mm, vm_flags, pglen);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
 					is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm))
 			vm_flags_clear(vma, VM_LOCKED_MASK);
 		else
-			mm->locked_vm += (len >> PAGE_SHIFT);
+			mm->locked_vm += pglen;
 	}
 
 	if (file)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 15/16] mm/mmap: Use vms accounted pages in mmap_region()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (13 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 14/16] mm/mmap: Use PHYS_PFN " Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-08 12:43   ` Lorenzo Stoakes
  2024-07-04 18:27 ` [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check " Liam R. Howlett
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Change from nr_pages variable to vms.nr_accounted for the charged pages
calculation.  This is necessary for a future patch.

This also avoids checking security_vm_enough_memory_mm() if the amount
of memory won't change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Kees Cook <kees@kernel.org>
---
 mm/mmap.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index b14da6bd257f..b2de26683903 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2980,6 +2980,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	} else {
 		/* Minimal setup of vms */
 		vms.nr_pages = 0;
+		vms.nr_accounted = 0;
 		next = vma_next(&vmi);
 		prev = vma_prev(&vmi);
 		if (prev)
@@ -2991,9 +2992,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	if (accountable_mapping(file, vm_flags)) {
 		charged = pglen;
-		charged -= nr_accounted;
-		if (security_vm_enough_memory_mm(mm, charged))
+		charged -= vms.nr_accounted;
+		if (charged && security_vm_enough_memory_mm(mm, charged))
 			goto abort_munmap;
+
 		vms.nr_accounted = 0;
 		vm_flags |= VM_ACCOUNT;
 	}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (14 preceding siblings ...)
  2024-07-04 18:27 ` [PATCH v3 15/16] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-07-04 18:27 ` Liam R. Howlett
  2024-07-08 12:52   ` Lorenzo Stoakes
  15 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:27 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
call, so use it instead of looping over the vmas twice.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index b2de26683903..62edaabf3987 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
 }
 
-static unsigned long count_vma_pages_range(struct mm_struct *mm,
-		unsigned long addr, unsigned long end,
-		unsigned long *nr_accounted)
-{
-	VMA_ITERATOR(vmi, mm, addr);
-	struct vm_area_struct *vma;
-	unsigned long nr_pages = 0;
-
-	*nr_accounted = 0;
-	for_each_vma_range(vmi, vma, end) {
-		unsigned long vm_start = max(addr, vma->vm_start);
-		unsigned long vm_end = min(end, vma->vm_end);
-
-		nr_pages += PHYS_PFN(vm_end - vm_start);
-		if (vma->vm_flags & VM_ACCOUNT)
-			*nr_accounted += PHYS_PFN(vm_end - vm_start);
-	}
-
-	return nr_pages;
-}
-
 static void __vma_link_file(struct vm_area_struct *vma,
 			    struct address_space *mapping)
 {
@@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	pgoff_t vm_pgoff;
 	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
-	unsigned long nr_pages, nr_accounted;
-
-	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
-
-	/* Check against address space limit. */
-	/*
-	 * MAP_FIXED may remove pages of mappings that intersects with requested
-	 * mapping. Account for the pages it would unmap.
-	 */
-	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
-		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
 		return -EPERM;
@@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			vma_iter_next_range(&vmi);
 	}
 
+	/* Check against address space limit. */
+	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
+		goto abort_munmap;
+
 	/*
 	 * Private writable mapping: check memory availability
 	 */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas()
  2024-07-04 18:27 ` [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
@ 2024-07-05 17:02   ` Lorenzo Stoakes
  2024-07-05 18:12     ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 17:02 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 04, 2024 at 02:27:04PM GMT, Liam R. Howlett wrote:
> Extract clean up of failed munmap() operations from
> do_vmi_align_munmap().  This simplifies later patches in the series.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/mmap.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 28a46d9ddde0..d572e1ff8255 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2586,6 +2586,25 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
>  			 vma->vm_userfaultfd_ctx, anon_vma_name(vma));
>  }
>
> +/*
> + * abort_munmap_vmas - Undo any munmap work and free resources
> + *
> + * Reattach detached vmas, free up maple tree used to track the vmas.
> + */
> +static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> +{
> +	struct vm_area_struct *vma;
> +	int limit;
> +
> +	limit = mas_detach->index;

This feels like a change to existing behaviour actually, I mean a sensible
one - as you are not just walking the tree start-to-end but rather only
walking up to the point that it has been populated (assuming I'm not
missing anything, looks to me like mas_for_each is _inclusive_ on max).

Maybe  worth mentioning in commit msg?

> +	mas_set(mas_detach, 0);
> +	/* Re-attach any detached VMAs */
> +	mas_for_each(mas_detach, vma, limit)
> +		vma_mark_detached(vma, false);
> +
> +	__mt_destroy(mas_detach->tree);
> +}
> +
>  /*
>   * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
>   * @vmi: The vma iterator
> @@ -2740,11 +2759,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  userfaultfd_error:
>  munmap_gather_failed:
>  end_split_failed:
> -	mas_set(&mas_detach, 0);
> -	mas_for_each(&mas_detach, next, end)
> -		vma_mark_detached(next, false);
> -
> -	__mt_destroy(&mt_detach);
> +	abort_munmap_vmas(&mas_detach);
>  start_split_failed:
>  map_count_exceeded:
>  	validate_mm(mm);
> --
> 2.43.0
>

This looks fine though, feel free to add:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas()
  2024-07-04 18:27 ` [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
@ 2024-07-05 17:39   ` Lorenzo Stoakes
  2024-07-10 16:07   ` Suren Baghdasaryan
  1 sibling, 0 replies; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 17:39 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 04, 2024 at 02:27:05PM GMT, Liam R. Howlett wrote:
> Extract all necessary operations that need to be completed after the vma
> maple tree is updated from a munmap() operation.  Extracting this makes
> the later patch in the series easier to understand.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/mmap.c | 79 +++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 53 insertions(+), 26 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d572e1ff8255..411798f46932 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2605,6 +2605,56 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
>  	__mt_destroy(mas_detach->tree);
>  }
>
> +/*
> + * vmi_complete_munmap_vmas() - Finish the munmap() operation
> + * @vmi: The vma iterator
> + * @start: The start address
> + * @end: The end address
> + * @unlock: Unlock the mm or not
> + * @mas_detach: them maple state of the detached vma maple tree
> + * @locked_vm: The locked_vm count in the detached vmas
> + *
> + * This function updates the mm_struct, unmaps the region, frees the resources
> + * used for the munmap() and may downgrade the lock - if requested.  Everything
> + * needed to be done once the vma maple tree is updated.
> + */
> +static void
> +vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
> +		struct mm_struct *mm, unsigned long start,
> +		unsigned long end, bool unlock, struct ma_state *mas_detach,
> +		unsigned long locked_vm)
> +{
> +	struct vm_area_struct *prev, *next;
> +	int count;
> +
> +	count = mas_detach->index + 1;
> +	mm->map_count -= count;
> +	mm->locked_vm -= locked_vm;
> +	if (unlock)
> +		mmap_write_downgrade(mm);
> +
> +	prev = vma_iter_prev_range(vmi);
> +	next = vma_next(vmi);
> +	if (next)
> +		vma_iter_prev_range(vmi);
> +
> +	/*
> +	 * We can free page tables without write-locking mmap_lock because VMAs
> +	 * were isolated before we downgraded mmap_lock.
> +	 */
> +	mas_set(mas_detach, 1);
> +	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
> +		     !unlock);
> +	/* Statistics and freeing VMAs */
> +	mas_set(mas_detach, 0);
> +	remove_mt(mm, mas_detach);
> +	validate_mm(mm);
> +	if (unlock)
> +		mmap_read_unlock(mm);
> +
> +	__mt_destroy(mas_detach->tree);
> +}
> +
>  /*
>   * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
>   * @vmi: The vma iterator
> @@ -2624,7 +2674,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		    struct mm_struct *mm, unsigned long start,
>  		    unsigned long end, struct list_head *uf, bool unlock)
>  {
> -	struct vm_area_struct *prev, *next = NULL;
> +	struct vm_area_struct *next = NULL;
>  	struct maple_tree mt_detach;
>  	int count = 0;
>  	int error = -ENOMEM;
> @@ -2728,31 +2778,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		goto clear_tree_failed;
>
>  	/* Point of no return */
> -	mm->locked_vm -= locked_vm;
> -	mm->map_count -= count;
> -	if (unlock)
> -		mmap_write_downgrade(mm);
> -
> -	prev = vma_iter_prev_range(vmi);
> -	next = vma_next(vmi);
> -	if (next)
> -		vma_iter_prev_range(vmi);
> -
> -	/*
> -	 * We can free page tables without write-locking mmap_lock because VMAs
> -	 * were isolated before we downgraded mmap_lock.
> -	 */
> -	mas_set(&mas_detach, 1);
> -	unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
> -		     !unlock);
> -	/* Statistics and freeing VMAs */
> -	mas_set(&mas_detach, 0);
> -	remove_mt(mm, &mas_detach);
> -	validate_mm(mm);
> -	if (unlock)
> -		mmap_read_unlock(mm);
> -
> -	__mt_destroy(&mt_detach);
> +	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
> +				 locked_vm);
>  	return 0;
>
>  clear_tree_failed:
> --
> 2.43.0
>
>

LGTM, feel free to add:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  2024-07-04 18:27 ` [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
@ 2024-07-05 18:01   ` Lorenzo Stoakes
  2024-07-05 18:41     ` Liam R. Howlett
  2024-07-10 16:07   ` Suren Baghdasaryan
  1 sibling, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 18:01 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 04, 2024 at 02:27:06PM GMT, Liam R. Howlett wrote:
> Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
> detached maple tree for removal later.  Part of the gathering is the
> splitting of vmas that span the boundary.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/mmap.c | 82 +++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 58 insertions(+), 24 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 411798f46932..8dc8ffbf9d8d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2656,32 +2656,29 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  }
>
>  /*
> - * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> + * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
> + * for removal at a later date.  Handles splitting first and last if necessary
> + * and marking the vmas as isolated.
> + *
>   * @vmi: The vma iterator
>   * @vma: The starting vm_area_struct
>   * @mm: The mm_struct
>   * @start: The aligned start address to munmap.
>   * @end: The aligned end address to munmap.
>   * @uf: The userfaultfd list_head
> - * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
> - * success.
> + * @mas_detach: The maple state tracking the detached tree

Missing the locked_vm parameter.

>   *
> - * Return: 0 on success and drops the lock if so directed, error and leaves the
> - * lock held otherwise.
> + * Return: 0 on success
>   */
>  static int
> -do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> +vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		    struct mm_struct *mm, unsigned long start,
> -		    unsigned long end, struct list_head *uf, bool unlock)
> +		    unsigned long end, struct list_head *uf,
> +		    struct ma_state *mas_detach, unsigned long *locked_vm)
>  {
>  	struct vm_area_struct *next = NULL;
> -	struct maple_tree mt_detach;
>  	int count = 0;
>  	int error = -ENOMEM;
> -	unsigned long locked_vm = 0;
> -	MA_STATE(mas_detach, &mt_detach, 0, 0);
> -	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> -	mt_on_stack(mt_detach);
>
>  	/*
>  	 * If we need to split any vma, do it now to save pain later.
> @@ -2720,15 +2717,14 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  				goto end_split_failed;
>  		}
>  		vma_start_write(next);
> -		mas_set(&mas_detach, count);
> -		error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
> +		mas_set(mas_detach, count++);
> +		if (next->vm_flags & VM_LOCKED)
> +			*locked_vm += vma_pages(next);
> +
> +		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
>  		if (error)
>  			goto munmap_gather_failed;
>  		vma_mark_detached(next, true);
> -		if (next->vm_flags & VM_LOCKED)
> -			locked_vm += vma_pages(next);
> -
> -		count++;
>  		if (unlikely(uf)) {
>  			/*
>  			 * If userfaultfd_unmap_prep returns an error the vmas
> @@ -2753,7 +2749,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
>  	/* Make sure no VMAs are about to be lost. */
>  	{
> -		MA_STATE(test, &mt_detach, 0, 0);
> +		MA_STATE(test, mas_detach->tree, 0, 0);
>  		struct vm_area_struct *vma_mas, *vma_test;
>  		int test_count = 0;
>
> @@ -2773,6 +2769,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	while (vma_iter_addr(vmi) > start)
>  		vma_iter_prev_range(vmi);
>
> +	return 0;
> +
> +userfaultfd_error:
> +munmap_gather_failed:
> +end_split_failed:
> +	abort_munmap_vmas(mas_detach);
> +start_split_failed:
> +map_count_exceeded:
> +	return error;
> +}
> +
> +/*
> + * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> + * @vmi: The vma iterator
> + * @vma: The starting vm_area_struct
> + * @mm: The mm_struct
> + * @start: The aligned start address to munmap.
> + * @end: The aligned end address to munmap.
> + * @uf: The userfaultfd list_head
> + * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
> + * success.
> + *
> + * Return: 0 on success and drops the lock if so directed, error and leaves the
> + * lock held otherwise.
> + */
> +static int
> +do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> +		    struct mm_struct *mm, unsigned long start,
> +		    unsigned long end, struct list_head *uf, bool unlock)
> +{
> +	struct maple_tree mt_detach;
> +	MA_STATE(mas_detach, &mt_detach, 0, 0);
> +	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> +	mt_on_stack(mt_detach);
> +	int error;
> +	unsigned long locked_vm = 0;
> +
> +	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
> +				       &mas_detach, &locked_vm);
> +	if (error)
> +		goto gather_failed;
> +
>  	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
>  	if (error)
>  		goto clear_tree_failed;
> @@ -2783,12 +2821,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	return 0;
>
>  clear_tree_failed:
> -userfaultfd_error:
> -munmap_gather_failed:
> -end_split_failed:
>  	abort_munmap_vmas(&mas_detach);
> -start_split_failed:
> -map_count_exceeded:
> +gather_failed:
>  	validate_mm(mm);
>  	return error;
>  }
> --
> 2.43.0
>
>

Other than trivial comment error, LGTM:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas()
  2024-07-05 17:02   ` Lorenzo Stoakes
@ 2024-07-05 18:12     ` Liam R. Howlett
  2024-07-10 16:06       ` Suren Baghdasaryan
  0 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-05 18:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 13:02]:
> On Thu, Jul 04, 2024 at 02:27:04PM GMT, Liam R. Howlett wrote:
> > Extract clean up of failed munmap() operations from
> > do_vmi_align_munmap().  This simplifies later patches in the series.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > ---
> >  mm/mmap.c | 25 ++++++++++++++++++++-----
> >  1 file changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 28a46d9ddde0..d572e1ff8255 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2586,6 +2586,25 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
> >  			 vma->vm_userfaultfd_ctx, anon_vma_name(vma));
> >  }
> >
> > +/*
> > + * abort_munmap_vmas - Undo any munmap work and free resources
> > + *
> > + * Reattach detached vmas, free up maple tree used to track the vmas.
> > + */
> > +static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> > +{
> > +	struct vm_area_struct *vma;
> > +	int limit;
> > +
> > +	limit = mas_detach->index;
> 
> This feels like a change to existing behaviour actually, I mean a sensible
> one - as you are not just walking the tree start-to-end but rather only
> walking up to the point that it has been populated (assuming I'm not
> missing anything, looks to me like mas_for_each is _inclusive_ on max).

This is not the main tree, but the detached tree.  It only contains the
vmas that are going to be freed (or, rather aborted from being freed).

I see what you mean that the end in the abort code below would be one
beyond the tree walk.  The new abort code uses the index (from the
previous write) as the limit.

All that really matters is that we go to a number high enough to cover
all vmas that were detached.  I used 'end' in the below code because I
knew it would cover all of the vmas added (we actually start at index
0).

The value of 'mas_detach->index' is used in the new code because I knew
that's as far as I had to go, and I could limit the arguments passed
to the function.

I think that I'll actually change limit to ULONG_MAX in another revision
because I like that better than expecting the index to have not been
touched by others.

> 
> Maybe  worth mentioning in commit msg?

Yes, good idea.  Thanks for catching this.

> 
> > +	mas_set(mas_detach, 0);
> > +	/* Re-attach any detached VMAs */
> > +	mas_for_each(mas_detach, vma, limit)
> > +		vma_mark_detached(vma, false);
> > +
> > +	__mt_destroy(mas_detach->tree);
> > +}
> > +
> >  /*
> >   * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> >   * @vmi: The vma iterator
> > @@ -2740,11 +2759,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  userfaultfd_error:
> >  munmap_gather_failed:
> >  end_split_failed:
> > -	mas_set(&mas_detach, 0);
> > -	mas_for_each(&mas_detach, next, end)
> > -		vma_mark_detached(next, false);
> > -
> > -	__mt_destroy(&mt_detach);
> > +	abort_munmap_vmas(&mas_detach);
> >  start_split_failed:
> >  map_count_exceeded:
> >  	validate_mm(mm);
> > --
> > 2.43.0
> >
> 
> This looks fine though, feel free to add:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks.



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  2024-07-04 18:27 ` [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
@ 2024-07-05 18:39   ` Lorenzo Stoakes
  2024-07-05 19:09     ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 18:39 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:07PM GMT, Liam R. Howlett wrote:
> Use a structure to pass along all the necessary information and counters
> involved in removing vmas from the mm_struct.
>
> Update vmi_ function names to vms_ to indicate the first argument
> type change.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/internal.h |  16 ++++++
>  mm/mmap.c     | 137 ++++++++++++++++++++++++++------------------------
>  2 files changed, 88 insertions(+), 65 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 2ea9a88dcb95..f1e6dea2efcf 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1481,6 +1481,22 @@ struct vma_prepare {
>  	struct vm_area_struct *remove2;
>  };
>
> +/*
> + * vma munmap operation
> + */
> +struct vma_munmap_struct {
> +	struct vma_iterator *vmi;
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;	/* The first vma to munmap */
> +	struct list_head *uf;		/* Userfaultfd list_head */
> +	unsigned long start;		/* Aligned start addr */
> +	unsigned long end;		/* Aligned end addr */
> +	int vma_count;			/* Number of vmas that will be removed */
> +	unsigned long nr_pages;		/* Number of pages being removed */
> +	unsigned long locked_vm;	/* Number of locked pages */
> +	bool unlock;			/* Unlock after the munmap */
> +};


I'm a big fan of breaking out and threading state like this through some of
these more... verbose VMA functions.

I have a similar idea as part of my long dreamed of 'delete vma_merge()'
patch set. Coming soon :)

> +
>  void __meminit __init_single_page(struct page *page, unsigned long pfn,
>  				unsigned long zone, int nid);
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8dc8ffbf9d8d..76e93146ee9d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -500,6 +500,31 @@ static inline void init_vma_prep(struct vma_prepare *vp,
>  	init_multi_vma_prep(vp, vma, NULL, NULL, NULL);
>  }
>
> +/*
> + * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> + * @vms: The vma munmap struct
> + * @vmi: The vma iterator
> + * @vma: The first vm_area_struct to munmap
> + * @start: The aligned start address to munmap
> + * @end: The aligned end address to munmap

Maybe worth mentioning if inclusive/exclusive.

> + * @uf: The userfaultfd list_head
> + * @unlock: Unlock after the operation.  Only unlocked on success
> + */
> +static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> +		struct vma_iterator *vmi, struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end, struct list_head *uf,
> +		bool unlock)
> +{
> +	vms->vmi = vmi;
> +	vms->vma = vma;
> +	vms->mm = vma->vm_mm;
> +	vms->start = start;
> +	vms->end = end;
> +	vms->unlock = unlock;
> +	vms->uf = uf;
> +	vms->vma_count = 0;
> +	vms->nr_pages = vms->locked_vm = 0;
> +}
>
>  /*
>   * vma_prepare() - Helper function for handling locking VMAs prior to altering
> @@ -2606,78 +2631,59 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
>  }
>
>  /*
> - * vmi_complete_munmap_vmas() - Finish the munmap() operation
> - * @vmi: The vma iterator
> - * @start: The start address
> - * @end: The end address
> - * @unlock: Unlock the mm or not
> - * @mas_detach: them maple state of the detached vma maple tree
> - * @locked_vm: The locked_vm count in the detached vmas
> - *
> - * This function updates the mm_struct, unmaps the region, frees the resources
> - * used for the munmap() and may downgrade the lock - if requested.  Everything
> - * needed to be done once the vma maple tree is updated.
> + * vms_complete_munmap_vmas() - Finish the munmap() operation
> + * @vms: The vma munmap struct
> + * @mas_detach: The maple state of the detached vmas
>   */
> -static void
> -vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
> -		struct mm_struct *mm, unsigned long start,
> -		unsigned long end, bool unlock, struct ma_state *mas_detach,
> -		unsigned long locked_vm)
> +
> +static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> +		struct ma_state *mas_detach)
>  {
>  	struct vm_area_struct *prev, *next;
> -	int count;
> +	struct mm_struct *mm;
>
> -	count = mas_detach->index + 1;
> -	mm->map_count -= count;
> -	mm->locked_vm -= locked_vm;
> -	if (unlock)
> +	mm = vms->mm;
> +	mm->map_count -= vms->vma_count;
> +	mm->locked_vm -= vms->locked_vm;
> +	if (vms->unlock)
>  		mmap_write_downgrade(mm);
>
> -	prev = vma_iter_prev_range(vmi);
> -	next = vma_next(vmi);
> +	prev = vma_iter_prev_range(vms->vmi);
> +	next = vma_next(vms->vmi);
>  	if (next)
> -		vma_iter_prev_range(vmi);
> +		vma_iter_prev_range(vms->vmi);
>
>  	/*
>  	 * We can free page tables without write-locking mmap_lock because VMAs
>  	 * were isolated before we downgraded mmap_lock.
>  	 */
>  	mas_set(mas_detach, 1);
> -	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
> -		     !unlock);
> +	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
> +		     vms->vma_count, !vms->unlock);
>  	/* Statistics and freeing VMAs */
>  	mas_set(mas_detach, 0);
>  	remove_mt(mm, mas_detach);
>  	validate_mm(mm);
> -	if (unlock)
> +	if (vms->unlock)
>  		mmap_read_unlock(mm);
>
>  	__mt_destroy(mas_detach->tree);
>  }
>
>  /*
> - * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
> + * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
>   * for removal at a later date.  Handles splitting first and last if necessary
>   * and marking the vmas as isolated.
>   *
> - * @vmi: The vma iterator
> - * @vma: The starting vm_area_struct
> - * @mm: The mm_struct
> - * @start: The aligned start address to munmap.
> - * @end: The aligned end address to munmap.
> - * @uf: The userfaultfd list_head
> + * @vms: The vma munmap struct
>   * @mas_detach: The maple state tracking the detached tree
>   *
>   * Return: 0 on success
>   */
> -static int
> -vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
> -		    struct mm_struct *mm, unsigned long start,
> -		    unsigned long end, struct list_head *uf,
> -		    struct ma_state *mas_detach, unsigned long *locked_vm)
> +static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> +		struct ma_state *mas_detach)
>  {

This is actually glorious.

>  	struct vm_area_struct *next = NULL;
> -	int count = 0;
>  	int error = -ENOMEM;
>
>  	/*
> @@ -2689,17 +2695,18 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	 */
>
>  	/* Does it split the first one? */
> -	if (start > vma->vm_start) {
> +	if (vms->start > vms->vma->vm_start) {
>
>  		/*
>  		 * Make sure that map_count on return from munmap() will
>  		 * not exceed its limit; but let map_count go just above
>  		 * its limit temporarily, to help free resources as expected.
>  		 */
> -		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
> +		if (vms->end < vms->vma->vm_end &&
> +		    vms->mm->map_count >= sysctl_max_map_count)
>  			goto map_count_exceeded;
>
> -		error = __split_vma(vmi, vma, start, 1);
> +		error = __split_vma(vms->vmi, vms->vma, vms->start, 1);
>  		if (error)
>  			goto start_split_failed;
>  	}
> @@ -2708,24 +2715,24 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	 * Detach a range of VMAs from the mm. Using next as a temp variable as
>  	 * it is always overwritten.
>  	 */
> -	next = vma;
> +	next = vms->vma;
>  	do {
>  		/* Does it split the end? */
> -		if (next->vm_end > end) {
> -			error = __split_vma(vmi, next, end, 0);
> +		if (next->vm_end > vms->end) {
> +			error = __split_vma(vms->vmi, next, vms->end, 0);
>  			if (error)
>  				goto end_split_failed;
>  		}
>  		vma_start_write(next);
> -		mas_set(mas_detach, count++);
> +		mas_set(mas_detach, vms->vma_count++);
>  		if (next->vm_flags & VM_LOCKED)
> -			*locked_vm += vma_pages(next);
> +			vms->locked_vm += vma_pages(next);
>
>  		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
>  		if (error)
>  			goto munmap_gather_failed;
>  		vma_mark_detached(next, true);
> -		if (unlikely(uf)) {
> +		if (unlikely(vms->uf)) {
>  			/*
>  			 * If userfaultfd_unmap_prep returns an error the vmas
>  			 * will remain split, but userland will get a
> @@ -2735,16 +2742,17 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  			 * split, despite we could. This is unlikely enough
>  			 * failure that it's not worth optimizing it for.
>  			 */
> -			error = userfaultfd_unmap_prep(next, start, end, uf);
> +			error = userfaultfd_unmap_prep(next, vms->start,
> +						       vms->end, vms->uf);
>
>  			if (error)
>  				goto userfaultfd_error;
>  		}
>  #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
> -		BUG_ON(next->vm_start < start);
> -		BUG_ON(next->vm_start > end);
> +		BUG_ON(next->vm_start < vms->start);
> +		BUG_ON(next->vm_start > vms->end);
>  #endif
> -	} for_each_vma_range(*vmi, next, end);
> +	} for_each_vma_range(*(vms->vmi), next, vms->end);
>
>  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
>  	/* Make sure no VMAs are about to be lost. */
> @@ -2753,21 +2761,21 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		struct vm_area_struct *vma_mas, *vma_test;
>  		int test_count = 0;
>
> -		vma_iter_set(vmi, start);
> +		vma_iter_set(vms->vmi, vms->start);
>  		rcu_read_lock();
> -		vma_test = mas_find(&test, count - 1);
> -		for_each_vma_range(*vmi, vma_mas, end) {
> +		vma_test = mas_find(&test, vms->vma_count - 1);
> +		for_each_vma_range(*(vms->vmi), vma_mas, vms->end) {
>  			BUG_ON(vma_mas != vma_test);
>  			test_count++;
> -			vma_test = mas_next(&test, count - 1);
> +			vma_test = mas_next(&test, vms->vma_count - 1);
>  		}
>  		rcu_read_unlock();
> -		BUG_ON(count != test_count);
> +		BUG_ON(vms->vma_count != test_count);
>  	}
>  #endif
>
> -	while (vma_iter_addr(vmi) > start)
> -		vma_iter_prev_range(vmi);
> +	while (vma_iter_addr(vms->vmi) > vms->start)
> +		vma_iter_prev_range(vms->vmi);
>
>  	return 0;
>
> @@ -2803,11 +2811,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	MA_STATE(mas_detach, &mt_detach, 0, 0);
>  	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
>  	mt_on_stack(mt_detach);
> +	struct vma_munmap_struct vms;
>  	int error;
> -	unsigned long locked_vm = 0;
>
> -	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
> -				       &mas_detach, &locked_vm);
> +	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
> +	error = vms_gather_munmap_vmas(&vms, &mas_detach);
>  	if (error)
>  		goto gather_failed;
>
> @@ -2816,8 +2824,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		goto clear_tree_failed;
>
>  	/* Point of no return */
> -	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
> -				 locked_vm);
> +	vms_complete_munmap_vmas(&vms, &mas_detach);
>  	return 0;
>
>  clear_tree_failed:
> --
> 2.43.0
>

Very nice refactoring.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  2024-07-05 18:01   ` Lorenzo Stoakes
@ 2024-07-05 18:41     ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-05 18:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 14:01]:
> On Thu, Jul 04, 2024 at 02:27:06PM GMT, Liam R. Howlett wrote:
> > Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
> > detached maple tree for removal later.  Part of the gathering is the
> > splitting of vmas that span the boundary.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > ---
> >  mm/mmap.c | 82 +++++++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 58 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 411798f46932..8dc8ffbf9d8d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2656,32 +2656,29 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  }
> >
> >  /*
> > - * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> > + * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
> > + * for removal at a later date.  Handles splitting first and last if necessary
> > + * and marking the vmas as isolated.
> > + *
> >   * @vmi: The vma iterator
> >   * @vma: The starting vm_area_struct
> >   * @mm: The mm_struct
> >   * @start: The aligned start address to munmap.
> >   * @end: The aligned end address to munmap.
> >   * @uf: The userfaultfd list_head
> > - * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
> > - * success.
> > + * @mas_detach: The maple state tracking the detached tree
> 
> Missing the locked_vm parameter.

Thanks.  This will be dropped later but I'll add it and drop it later.

> 
> >   *
> > - * Return: 0 on success and drops the lock if so directed, error and leaves the
> > - * lock held otherwise.
> > + * Return: 0 on success
> >   */
> >  static int
> > -do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > +vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  		    struct mm_struct *mm, unsigned long start,
> > -		    unsigned long end, struct list_head *uf, bool unlock)
> > +		    unsigned long end, struct list_head *uf,
> > +		    struct ma_state *mas_detach, unsigned long *locked_vm)

...

> 
> Other than trivial comment error, LGTM:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  2024-07-05 18:39   ` Lorenzo Stoakes
@ 2024-07-05 19:09     ` Liam R. Howlett
  2024-07-10 16:07       ` Suren Baghdasaryan
  0 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-05 19:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 14:39]:
> On Thu, Jul 04, 2024 at 02:27:07PM GMT, Liam R. Howlett wrote:
> > Use a structure to pass along all the necessary information and counters
> > involved in removing vmas from the mm_struct.
> >
> > Update vmi_ function names to vms_ to indicate the first argument
> > type change.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/internal.h |  16 ++++++
> >  mm/mmap.c     | 137 ++++++++++++++++++++++++++------------------------
> >  2 files changed, 88 insertions(+), 65 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 2ea9a88dcb95..f1e6dea2efcf 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1481,6 +1481,22 @@ struct vma_prepare {
> >  	struct vm_area_struct *remove2;
> >  };
> >
> > +/*
> > + * vma munmap operation
> > + */
> > +struct vma_munmap_struct {
> > +	struct vma_iterator *vmi;
> > +	struct mm_struct *mm;
> > +	struct vm_area_struct *vma;	/* The first vma to munmap */
> > +	struct list_head *uf;		/* Userfaultfd list_head */
> > +	unsigned long start;		/* Aligned start addr */
> > +	unsigned long end;		/* Aligned end addr */
> > +	int vma_count;			/* Number of vmas that will be removed */
> > +	unsigned long nr_pages;		/* Number of pages being removed */
> > +	unsigned long locked_vm;	/* Number of locked pages */
> > +	bool unlock;			/* Unlock after the munmap */
> > +};
> 
> 
> I'm a big fan of breaking out and threading state like this through some of
> these more... verbose VMA functions.
> 
> I have a similar idea as part of my long dreamed of 'delete vma_merge()'
> patch set. Coming soon :)
> 
> > +
> >  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> >  				unsigned long zone, int nid);
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 8dc8ffbf9d8d..76e93146ee9d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -500,6 +500,31 @@ static inline void init_vma_prep(struct vma_prepare *vp,
> >  	init_multi_vma_prep(vp, vma, NULL, NULL, NULL);
> >  }
> >
> > +/*
> > + * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> > + * @vms: The vma munmap struct
> > + * @vmi: The vma iterator
> > + * @vma: The first vm_area_struct to munmap
> > + * @start: The aligned start address to munmap
> > + * @end: The aligned end address to munmap
> 
> Maybe worth mentioning if inclusive/exclusive.

The "address to munmap" isn't specific enough that we are using the same
logic as the munmap call?  That is, the vma inclusive and exclusive for
start and end, respectively.

Not a big change, either way.

...

> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
  2024-07-04 18:27 ` [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
@ 2024-07-05 19:27   ` Lorenzo Stoakes
  2024-07-05 19:59     ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 19:27 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:08PM GMT, Liam R. Howlett wrote:
> Clean up the code by changing the munmap operation to use a structure
> for the accounting and munmap variables.
>
> Since remove_mt() is only called in one location and the contents will
> be reduce to almost nothing.  The remains of the function can be added
> to vms_complete_munmap_vmas().
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/internal.h |  6 ++++
>  mm/mmap.c     | 81 ++++++++++++++++++++++++++-------------------------
>  2 files changed, 47 insertions(+), 40 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index f1e6dea2efcf..8cbbbe7d40f3 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1488,12 +1488,18 @@ struct vma_munmap_struct {
>  	struct vma_iterator *vmi;
>  	struct mm_struct *mm;
>  	struct vm_area_struct *vma;	/* The first vma to munmap */
> +	struct vm_area_struct *next;	/* vma after the munmap area */
> +	struct vm_area_struct *prev;    /* vma before the munmap area */

I mean this is about as pedantic as it gets, and, admittedly an annoying
comment to make, but the ordering... can't we at least put prev before
next? ;)

This is actually a comment you can fully disregard, by the way!

>  	struct list_head *uf;		/* Userfaultfd list_head */
>  	unsigned long start;		/* Aligned start addr */
>  	unsigned long end;		/* Aligned end addr */
>  	int vma_count;			/* Number of vmas that will be removed */
>  	unsigned long nr_pages;		/* Number of pages being removed */
>  	unsigned long locked_vm;	/* Number of locked pages */
> +	unsigned long nr_accounted;	/* Number of VM_ACCOUNT pages */
> +	unsigned long exec_vm;
> +	unsigned long stack_vm;
> +	unsigned long data_vm;
>  	bool unlock;			/* Unlock after the munmap */
>  };
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 76e93146ee9d..2a1a49f98fa3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -523,7 +523,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
>  	vms->unlock = unlock;
>  	vms->uf = uf;
>  	vms->vma_count = 0;
> -	vms->nr_pages = vms->locked_vm = 0;
> +	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
> +	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
>  }
>
>  /*
> @@ -2388,30 +2389,6 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
>  	return vma;
>  }
>
> -/*
> - * Ok - we have the memory areas we should free on a maple tree so release them,
> - * and do the vma updates.
> - *
> - * Called with the mm semaphore held.
> - */
> -static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
> -{
> -	unsigned long nr_accounted = 0;
> -	struct vm_area_struct *vma;
> -
> -	/* Update high watermark before we lower total_vm */
> -	update_hiwater_vm(mm);
> -	mas_for_each(mas, vma, ULONG_MAX) {
> -		long nrpages = vma_pages(vma);
> -
> -		if (vma->vm_flags & VM_ACCOUNT)
> -			nr_accounted += nrpages;
> -		vm_stat_account(mm, vma->vm_flags, -nrpages);
> -		remove_vma(vma, false);
> -	}
> -	vm_unacct_memory(nr_accounted);
> -}
> -
>  /*
>   * Get rid of page table information in the indicated region.
>   *
> @@ -2634,12 +2611,15 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
>   * vms_complete_munmap_vmas() - Finish the munmap() operation
>   * @vms: The vma munmap struct
>   * @mas_detach: The maple state of the detached vmas
> + *
> + * This function updates the mm_struct, unmaps the region, frees the resources
> + * used for the munmap() and may downgrade the lock - if requested.  Everything
> + * needed to be done once the vma maple tree is updated.
>   */
> -
>  static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  		struct ma_state *mas_detach)
>  {
> -	struct vm_area_struct *prev, *next;
> +	struct vm_area_struct *vma;
>  	struct mm_struct *mm;
>
>  	mm = vms->mm;
> @@ -2648,21 +2628,26 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	if (vms->unlock)
>  		mmap_write_downgrade(mm);
>
> -	prev = vma_iter_prev_range(vms->vmi);
> -	next = vma_next(vms->vmi);
> -	if (next)
> -		vma_iter_prev_range(vms->vmi);
> -
>  	/*
>  	 * We can free page tables without write-locking mmap_lock because VMAs
>  	 * were isolated before we downgraded mmap_lock.
>  	 */
>  	mas_set(mas_detach, 1);
> -	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
> -		     vms->vma_count, !vms->unlock);
> -	/* Statistics and freeing VMAs */
> +	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
> +		     vms->start, vms->end, vms->vma_count, !vms->unlock);
> +	/* Update high watermark before we lower total_vm */
> +	update_hiwater_vm(mm);
> +	/* Stat accounting */
> +	WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
> +	mm->exec_vm -= vms->exec_vm;
> +	mm->stack_vm -= vms->stack_vm;
> +	mm->data_vm -= vms->data_vm;

OK I was going to ask you're READ_ONCE()'ing and WRITE_ONCE()'ing
mm->total_vm but not the other statistics, but the original
vm_stat_account() does this so. Yeah :)

> +	/* Remove and clean up vmas */
>  	mas_set(mas_detach, 0);
> -	remove_mt(mm, mas_detach);
> +	mas_for_each(mas_detach, vma, ULONG_MAX)
> +		remove_vma(vma, false);
> +
> +	vm_unacct_memory(vms->nr_accounted);
>  	validate_mm(mm);
>  	if (vms->unlock)
>  		mmap_read_unlock(mm);
> @@ -2710,13 +2695,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  		if (error)
>  			goto start_split_failed;
>  	}
> +	vms->prev = vma_prev(vms->vmi);
>
>  	/*
>  	 * Detach a range of VMAs from the mm. Using next as a temp variable as
>  	 * it is always overwritten.
>  	 */
> -	next = vms->vma;
> -	do {
> +	for_each_vma_range(*(vms->vmi), next, vms->end) {
> +		long nrpages;
>  		/* Does it split the end? */
>  		if (next->vm_end > vms->end) {
>  			error = __split_vma(vms->vmi, next, vms->end, 0);
> @@ -2725,8 +2711,21 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  		}
>  		vma_start_write(next);
>  		mas_set(mas_detach, vms->vma_count++);
> +		nrpages = vma_pages(next);
> +
> +		vms->nr_pages += nrpages;
>  		if (next->vm_flags & VM_LOCKED)
> -			vms->locked_vm += vma_pages(next);
> +			vms->locked_vm += nrpages;
> +
> +		if (next->vm_flags & VM_ACCOUNT)
> +			vms->nr_accounted += nrpages;
> +
> +		if (is_exec_mapping(next->vm_flags))
> +			vms->exec_vm += nrpages;
> +		else if (is_stack_mapping(next->vm_flags))
> +			vms->stack_vm += nrpages;
> +		else if (is_data_mapping(next->vm_flags))
> +			vms->data_vm += nrpages;
>
>  		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
>  		if (error)
> @@ -2752,7 +2751,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  		BUG_ON(next->vm_start < vms->start);
>  		BUG_ON(next->vm_start > vms->end);
>  #endif
> -	} for_each_vma_range(*(vms->vmi), next, vms->end);
> +	}
> +
> +	vms->next = vma_next(vms->vmi);
>
>  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
>  	/* Make sure no VMAs are about to be lost. */
> --
> 2.43.0
>

This is a big improvement overall, very fiddly code.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 07/16] mm/mmap: Extract validate_mm() from vma_complete()
  2024-07-04 18:27 ` [PATCH v3 07/16] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
@ 2024-07-05 19:35   ` Lorenzo Stoakes
  2024-07-10 16:06     ` Suren Baghdasaryan
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 19:35 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 04, 2024 at 02:27:09PM GMT, Liam R. Howlett wrote:
> vma_complete() will need to be called during an unsafe time to call
> validate_mm().  Extract the call in all places now so that only one
> location can be modified in the next change.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/mmap.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2a1a49f98fa3..8d9be791997a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -646,7 +646,6 @@ static inline void vma_complete(struct vma_prepare *vp,
>  	}
>  	if (vp->insert && vp->file)
>  		uprobe_mmap(vp->insert);
> -	validate_mm(mm);
>  }
>
>  /*
> @@ -734,6 +733,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	vma_iter_store(vmi, vma);
>
>  	vma_complete(&vp, vmi, vma->vm_mm);
> +	validate_mm(vma->vm_mm);
>  	return 0;
>
>  nomem:
> @@ -775,6 +775,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	vma_iter_clear(vmi);
>  	vma_set_range(vma, start, end, pgoff);
>  	vma_complete(&vp, vmi, vma->vm_mm);
> +	validate_mm(vma->vm_mm);
>  	return 0;
>  }
>
> @@ -1103,6 +1104,7 @@ static struct vm_area_struct
>  	}
>
>  	vma_complete(&vp, vmi, mm);
> +	validate_mm(mm);
>  	khugepaged_enter_vma(res, vm_flags);
>  	return res;
>
> @@ -2481,6 +2483,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>
>  	/* vma_complete stores the new vma */
>  	vma_complete(&vp, vmi, vma->vm_mm);
> +	validate_mm(vma->vm_mm);
>
>  	/* Success. */
>  	if (new_below)
> @@ -3353,6 +3356,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		vma_iter_store(vmi, vma);
>
>  		vma_complete(&vp, vmi, mm);
> +		validate_mm(mm);
>  		khugepaged_enter_vma(vma, flags);
>  		goto out;
>  	}
> --
> 2.43.0
>
>

LGTM

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region()
  2024-07-04 18:27 ` [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
@ 2024-07-05 19:39   ` Lorenzo Stoakes
  2024-07-05 20:00     ` Liam R. Howlett
  2024-07-10 16:15   ` Suren Baghdasaryan
  1 sibling, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 19:39 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:10PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> mmap_region is already passed sanitized addr and len, so change the
> call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
> checks.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8d9be791997a..e9858ca8bbd4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2937,12 +2937,20 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  			return -ENOMEM;
>  	}
>
> -	/* Unmap any existing mapping in the area */

This feels like an important comment, I know 'find the first overlapping
VMA' below alludes to it but whenever I've read through this function this
comment has actually been quite useful to know what the intent is.

Could we reinstate it?

> -	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
> -	if (error == -EPERM)
> -		return error;
> -	else if (error)
> -		return -ENOMEM;
> +
> +	if (unlikely(!can_modify_mm(mm, addr, end)))
> +		return -EPERM;
> +
> +	 /* arch_unmap() might do unmaps itself.  */
> +	arch_unmap(mm, addr, end);
> +
> +	/* Find the first overlapping VMA */
> +	vma = vma_find(&vmi, end);
> +	if (vma) {
> +		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> +			return -ENOMEM;
> +		vma = NULL;
> +	}
>
>  	/*
>  	 * Private writable mapping: check memory availability
> --
> 2.43.0
>
>

Other than trivial point above, LGTM.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
  2024-07-05 19:27   ` Lorenzo Stoakes
@ 2024-07-05 19:59     ` Liam R. Howlett
  2024-07-10 16:07       ` Suren Baghdasaryan
  0 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-05 19:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 15:27]:
> On Thu, Jul 04, 2024 at 02:27:08PM GMT, Liam R. Howlett wrote:
> > Clean up the code by changing the munmap operation to use a structure
> > for the accounting and munmap variables.
> >
> > Since remove_mt() is only called in one location and the contents will
> > be reduce to almost nothing.  The remains of the function can be added
> > to vms_complete_munmap_vmas().
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > ---
> >  mm/internal.h |  6 ++++
> >  mm/mmap.c     | 81 ++++++++++++++++++++++++++-------------------------
> >  2 files changed, 47 insertions(+), 40 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index f1e6dea2efcf..8cbbbe7d40f3 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1488,12 +1488,18 @@ struct vma_munmap_struct {
> >  	struct vma_iterator *vmi;
> >  	struct mm_struct *mm;
> >  	struct vm_area_struct *vma;	/* The first vma to munmap */
> > +	struct vm_area_struct *next;	/* vma after the munmap area */
> > +	struct vm_area_struct *prev;    /* vma before the munmap area */
> 
> I mean this is about as pedantic as it gets, and, admittedly an annoying
> comment to make, but the ordering... can't we at least put prev before
> next? ;)

I can do that, no problem.
...

> 
> This is a big improvement overall, very fiddly code.
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region()
  2024-07-05 19:39   ` Lorenzo Stoakes
@ 2024-07-05 20:00     ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-05 20:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 15:40]:
> On Thu, Jul 04, 2024 at 02:27:10PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > mmap_region is already passed sanitized addr and len, so change the
> > call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
> > checks.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 20 ++++++++++++++------
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 8d9be791997a..e9858ca8bbd4 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2937,12 +2937,20 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  			return -ENOMEM;
> >  	}
> >
> > -	/* Unmap any existing mapping in the area */
> 
> This feels like an important comment, I know 'find the first overlapping
> VMA' below alludes to it but whenever I've read through this function this
> comment has actually been quite useful to know what the intent is.
> 
> Could we reinstate it?

Ah, sure.  I am going to have to move it around and it sort of loses its
meaning a bit once we gather and complete it (and not actually zero the
tree)..  I'll try to keep it with the correct code block.

> 
> > -	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
> > -	if (error == -EPERM)
> > -		return error;
> > -	else if (error)
> > -		return -ENOMEM;
> > +
> > +	if (unlikely(!can_modify_mm(mm, addr, end)))
> > +		return -EPERM;
> > +
> > +	 /* arch_unmap() might do unmaps itself.  */
> > +	arch_unmap(mm, addr, end);
> > +
> > +	/* Find the first overlapping VMA */
> > +	vma = vma_find(&vmi, end);
> > +	if (vma) {
> > +		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> > +			return -ENOMEM;
> > +		vma = NULL;
> > +	}
> >
> >  	/*
> >  	 * Private writable mapping: check memory availability
> > --
> > 2.43.0
> >
> >
> 
> Other than trivial point above, LGTM.
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call
  2024-07-04 18:27 ` [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
@ 2024-07-05 20:06   ` Lorenzo Stoakes
  2024-07-05 20:30     ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 20:06 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 04, 2024 at 02:27:11PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Open code the do_vmi_align_munmap() call so that it can be broken up
> later in the series.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index e9858ca8bbd4..f5b33de4e717 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2915,6 +2915,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	struct vm_area_struct *next, *prev, *merge;
>  	pgoff_t pglen = len >> PAGE_SHIFT;
>  	unsigned long charged = 0;
> +	struct vma_munmap_struct vms;
> +	struct ma_state mas_detach;
> +	struct maple_tree mt_detach;
>  	unsigned long end = addr + len;
>  	unsigned long merge_start = addr, merge_end = end;
>  	bool writable_file_mapping = false;
> @@ -2947,9 +2950,24 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	/* Find the first overlapping VMA */
>  	vma = vma_find(&vmi, end);
>  	if (vma) {
> -		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> +		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> +		mt_on_stack(mt_detach);
> +		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);

I'm guessing this is exactly equivalent (or equivalent enough for our purposes
here) to the MA_STATE() call in do_vmi_align_munmap()?

Checking the two against each other it seems that it is indeed.

> +		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> +		if (vms_gather_munmap_vmas(&vms, &mas_detach))
> +			return -ENOMEM;

In do_vmi_align_munmap() we also invoke vmalidate_mm(), why aren't we doing that
here?

> +
> +		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
>  			return -ENOMEM;

Same here.

> +
> +		vms_complete_munmap_vmas(&vms, &mas_detach);
> +		next = vms.next;
> +		prev = vms.prev;
> +		vma_prev(&vmi);

I'm sure this is correct, but just to double-check - we want to set the VMI to
prev here right?

It might be worth adding a small cmoment saying '/* vmi now points at prev */'
or similar, I've found it can get quite hard to follow where the iterator is at
sometimes.

>  		vma = NULL;
> +	} else {
> +		next = vma_next(&vmi);
> +		prev = vma_prev(&vmi);

But here we move forward to the next VMA and set this to next, then go back to
the _original_ one and this is prev?

Actually I guess if vma == NULL, next gets you to the next, and prev jumps back
to prev, with nothing between, and so that makes sense.


>  	}
>
>  	/*
> @@ -2962,8 +2980,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		vm_flags |= VM_ACCOUNT;
>  	}
>
> -	next = vma_next(&vmi);
> -	prev = vma_prev(&vmi);
>  	if (vm_flags & VM_SPECIAL) {
>  		if (prev)
>  			vma_iter_next_range(&vmi);
> --
> 2.43.0
>
>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region()
  2024-07-04 18:27 ` [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
@ 2024-07-05 20:18   ` Lorenzo Stoakes
  2024-07-05 20:56     ` Liam R. Howlett
  2024-07-10 16:48   ` Suren Baghdasaryan
  1 sibling, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 20:18 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:12PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Instead of moving (or leaving) the vma iterator pointing at the previous
> vma, leave it pointing at the insert location.  Pointing the vma
> iterator at the insert location allows for a cleaner walk of the vma
> tree for MAP_FIXED and the no expansion cases.

I mean, it's funny I litearlly just asked why it was being left pointing at
prev and I guess this answers that basically :)

>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f5b33de4e717..ecf55d32e804 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2963,11 +2963,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		vms_complete_munmap_vmas(&vms, &mas_detach);
>  		next = vms.next;
>  		prev = vms.prev;
> -		vma_prev(&vmi);
>  		vma = NULL;
>  	} else {
>  		next = vma_next(&vmi);
>  		prev = vma_prev(&vmi);
> +		if (prev)
> +			vma_iter_next_range(&vmi);
>  	}
>
>  	/*
> @@ -2980,11 +2981,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		vm_flags |= VM_ACCOUNT;
>  	}
>
> -	if (vm_flags & VM_SPECIAL) {
> -		if (prev)
> -			vma_iter_next_range(&vmi);
> +	if (vm_flags & VM_SPECIAL)
>  		goto cannot_expand;
> -	}
>
>  	/* Attempt to expand an old mapping */
>  	/* Check next */
> @@ -3005,19 +3003,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		merge_start = prev->vm_start;
>  		vma = prev;
>  		vm_pgoff = prev->vm_pgoff;
> -	} else if (prev) {
> -		vma_iter_next_range(&vmi);
> +		vma_prev(&vmi);

Why not vma_iter_prev_range()?

>  	}
>
> -	/* Actually expand, if possible */
> -	if (vma &&
> -	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> -		khugepaged_enter_vma(vma, vm_flags);
> -		goto expanded;
> +	if (vma) {
> +		/* Actually expand, if possible */
> +		if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> +			khugepaged_enter_vma(vma, vm_flags);
> +			goto expanded;
> +		}
> +
> +		/* If the expand fails, then reposition the vma iterator */
> +		if (unlikely(vma == prev))
> +			vma_iter_set(&vmi, addr);
>  	}
>
> -	if (vma == prev)
> -		vma_iter_set(&vmi, addr);
>  cannot_expand:
>
>  	/*
> --
> 2.43.0
>
Looks good to me,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct
  2024-07-04 18:27 ` [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
@ 2024-07-05 20:27   ` Lorenzo Stoakes
  2024-07-08 14:45     ` Liam R. Howlett
  2024-07-10 17:14     ` Suren Baghdasaryan
  0 siblings, 2 replies; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 20:27 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:13PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Set the start and end address for munmap when the prev and next are
> gathered.  This is needed to avoid incorrect addresses being used during
> the vms_complete_munmap_vmas() function if the prev/next vma are
> expanded.

When we spoke about this separately you mentioned that specific arches may
be more likely to encounter this issue, perhaps worth mentioning something
about that in the commit msg? Unless I misunderstood you.

>
> Add a new helper vms_complete_pte_clear(), which is needed later and
> will avoid growing the argument list to unmap_region() beyond the 9 it
> already has.

My word.

>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/internal.h |  2 ++
>  mm/mmap.c     | 34 +++++++++++++++++++++++++++-------
>  2 files changed, 29 insertions(+), 7 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 8cbbbe7d40f3..4c9f06669cc4 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1493,6 +1493,8 @@ struct vma_munmap_struct {
>  	struct list_head *uf;		/* Userfaultfd list_head */
>  	unsigned long start;		/* Aligned start addr */
>  	unsigned long end;		/* Aligned end addr */
> +	unsigned long unmap_start;
> +	unsigned long unmap_end;
>  	int vma_count;			/* Number of vmas that will be removed */
>  	unsigned long nr_pages;		/* Number of pages being removed */
>  	unsigned long locked_vm;	/* Number of locked pages */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index ecf55d32e804..45443a53be76 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -525,6 +525,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
>  	vms->vma_count = 0;
>  	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
>  	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> +	vms->unmap_start = FIRST_USER_ADDRESS;
> +	vms->unmap_end = USER_PGTABLES_CEILING;
>  }
>
>  /*
> @@ -2610,6 +2612,26 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
>  	__mt_destroy(mas_detach->tree);
>  }
>
> +
> +static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> +		struct ma_state *mas_detach, bool mm_wr_locked)
> +{
> +	struct mmu_gather tlb;
> +
> +	/*
> +	 * We can free page tables without write-locking mmap_lock because VMAs
> +	 * were isolated before we downgraded mmap_lock.
> +	 */
> +	mas_set(mas_detach, 1);
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, vms->mm);
> +	update_hiwater_rss(vms->mm);
> +	unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
> +	mas_set(mas_detach, 1);

I know it's necessary as unmap_vmas() will adjust mas_detach, but it kind
of aesthetically sucks to set it to 1, do some stuff, then set it to 1
again. But this is not a big deal :>)

> +	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);

Yeah this bit definitely needs a comment I think, this is very confusing
indeed. Under what circumstances will these differ from [vms->start,
vms->end), etc.?

I'm guessing it's to do with !vms->prev and !vms->next needing to be set to
[FIRST_USER_ADDRESS, USER_PGTABLES_CEILING)?

> +	tlb_finish_mmu(&tlb);
> +}
> +
>  /*
>   * vms_complete_munmap_vmas() - Finish the munmap() operation
>   * @vms: The vma munmap struct
> @@ -2631,13 +2653,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
>  	if (vms->unlock)
>  		mmap_write_downgrade(mm);
>
> -	/*
> -	 * We can free page tables without write-locking mmap_lock because VMAs
> -	 * were isolated before we downgraded mmap_lock.
> -	 */
> -	mas_set(mas_detach, 1);
> -	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
> -		     vms->start, vms->end, vms->vma_count, !vms->unlock);
> +	vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
>  	/* Update high watermark before we lower total_vm */
>  	update_hiwater_vm(mm);
>  	/* Stat accounting */
> @@ -2699,6 +2715,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  			goto start_split_failed;
>  	}
>  	vms->prev = vma_prev(vms->vmi);
> +	if (vms->prev)
> +		vms->unmap_start = vms->prev->vm_end;
>
>  	/*
>  	 * Detach a range of VMAs from the mm. Using next as a temp variable as
> @@ -2757,6 +2775,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
>  	}
>
>  	vms->next = vma_next(vms->vmi);
> +	if (vms->next)
> +		vms->unmap_end = vms->next->vm_start;
>
>  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
>  	/* Make sure no VMAs are about to be lost. */
> --
> 2.43.0
>

Other than wanting some extra comments, this looks fine and I know how
hard-won the unmap range bit of this change was so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call
  2024-07-05 20:06   ` Lorenzo Stoakes
@ 2024-07-05 20:30     ` Liam R. Howlett
  2024-07-05 20:36       ` Lorenzo Stoakes
  0 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-05 20:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 16:06]:
> On Thu, Jul 04, 2024 at 02:27:11PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Open code the do_vmi_align_munmap() call so that it can be broken up
> > later in the series.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 22 +++++++++++++++++++---
> >  1 file changed, 19 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index e9858ca8bbd4..f5b33de4e717 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2915,6 +2915,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	struct vm_area_struct *next, *prev, *merge;
> >  	pgoff_t pglen = len >> PAGE_SHIFT;
> >  	unsigned long charged = 0;
> > +	struct vma_munmap_struct vms;
> > +	struct ma_state mas_detach;
> > +	struct maple_tree mt_detach;
> >  	unsigned long end = addr + len;
> >  	unsigned long merge_start = addr, merge_end = end;
> >  	bool writable_file_mapping = false;
> > @@ -2947,9 +2950,24 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	/* Find the first overlapping VMA */
> >  	vma = vma_find(&vmi, end);
> >  	if (vma) {
> > -		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> > +		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > +		mt_on_stack(mt_detach);
> > +		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> 
> I'm guessing this is exactly equivalent (or equivalent enough for our purposes
> here) to the MA_STATE() call in do_vmi_align_munmap()?

Yes, what we are doing is inlining the function call so that it can be
split to parts.  So for reviewing, I just made it in-line.

> 
> Checking the two against each other it seems that it is indeed.
> 
> > +		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > +		if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > +			return -ENOMEM;
> 
> In do_vmi_align_munmap() we also invoke vmalidate_mm(), why aren't we doing that
> here?

I don't see the validate_mm() call in do_vmi_align_munmap, it is called
in the vms_complete_munmap_vmas() function  though.

> 
> > +
> > +		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> >  			return -ENOMEM;
> 
> Same here.

It would fail here, our count would be wrong.

> 
> > +
> > +		vms_complete_munmap_vmas(&vms, &mas_detach);
> > +		next = vms.next;
> > +		prev = vms.prev;
> > +		vma_prev(&vmi);
> 
> I'm sure this is correct, but just to double-check - we want to set the VMI to
> prev here right?

Yes.  In the current function, the vmi points to prev when we are done
this dance - so I put it there.
> 
> It might be worth adding a small cmoment saying '/* vmi now points at prev */'
> or similar, I've found it can get quite hard to follow where the iterator is at
> sometimes.

So this is about to change..

> 
> >  		vma = NULL;
> > +	} else {
> > +		next = vma_next(&vmi);
> > +		prev = vma_prev(&vmi);
> 
> But here we move forward to the next VMA and set this to next, then go back to
> the _original_ one and this is prev?
> 
> Actually I guess if vma == NULL, next gets you to the next, and prev jumps back
> to prev, with nothing between, and so that makes sense.

Yes, this is what I'm doing.

> 
> 
> >  	}
> >
> >  	/*
> > @@ -2962,8 +2980,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		vm_flags |= VM_ACCOUNT;
> >  	}
> >
> > -	next = vma_next(&vmi);
> > -	prev = vma_prev(&vmi);
> >  	if (vm_flags & VM_SPECIAL) {
> >  		if (prev)
> >  			vma_iter_next_range(&vmi);
> > --
> > 2.43.0
> >
> >


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 12/16] mm/mmap: Clean up unmap_region() argument list
  2024-07-04 18:27 ` [PATCH v3 12/16] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
@ 2024-07-05 20:33   ` Lorenzo Stoakes
  2024-07-10 17:14     ` Suren Baghdasaryan
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 20:33 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:14PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> With the only caller to unmap_region() being the error path of
> mmap_region(), the argument list can be significantly reduced.
>
> There is also no need to forward declare the static function any
> longer.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 25 +++++++++----------------
>  1 file changed, 9 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 45443a53be76..5d458c5f080e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -76,11 +76,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
>  static bool ignore_rlimit_data;
>  core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
>
> -static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
> -		struct vm_area_struct *vma, struct vm_area_struct *prev,
> -		struct vm_area_struct *next, unsigned long start,
> -		unsigned long end, unsigned long tree_end, bool mm_wr_locked);
> -
>  static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
>  {
>  	return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
> @@ -2398,22 +2393,21 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
>   *
>   * Called with the mm semaphore held.
>   */
> -static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
> -		struct vm_area_struct *vma, struct vm_area_struct *prev,
> -		struct vm_area_struct *next, unsigned long start,
> -		unsigned long end, unsigned long tree_end, bool mm_wr_locked)
> +static void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> +		struct vm_area_struct *prev, struct vm_area_struct *next)
>  {
> +	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_gather tlb;
> -	unsigned long mt_start = mas->index;
>
>  	lru_add_drain();
>  	tlb_gather_mmu(&tlb, mm);
>  	update_hiwater_rss(mm);
> -	unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
> -	mas_set(mas, mt_start);

OK so the 'raise something in one patch only for it to be removed in the
next' pattern continues :) more aesthetically plasing.

> +	unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
> +		   /* mm_wr_locked = */ true);
> +	mas_set(mas, vma->vm_end);
>  	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> -				 next ? next->vm_start : USER_PGTABLES_CEILING,
> -				 mm_wr_locked);
> +		      next ? next->vm_start : USER_PGTABLES_CEILING,
> +		      /* mm_wr_locked = */ true);
>  	tlb_finish_mmu(&tlb);
>  }
>
> @@ -3186,8 +3180,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
>  		vma_iter_set(&vmi, vma->vm_end);
>  		/* Undo any partial mapping done by a device driver. */
> -		unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
> -			     vma->vm_end, vma->vm_end, true);
> +		unmap_region(&vmi.mas, vma, prev, next);
>  	}
>  	if (writable_file_mapping)
>  		mapping_unmap_writable(file->f_mapping);
> --
> 2.43.0
>

Lovely reduction in parameters.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call
  2024-07-05 20:30     ` Liam R. Howlett
@ 2024-07-05 20:36       ` Lorenzo Stoakes
  2024-07-08 14:49         ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-05 20:36 UTC (permalink / raw)
  To: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Fri, Jul 05, 2024 at 04:30:46PM GMT, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 16:06]:
> > On Thu, Jul 04, 2024 at 02:27:11PM GMT, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > Open code the do_vmi_align_munmap() call so that it can be broken up
> > > later in the series.
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > ---
> > >  mm/mmap.c | 22 +++++++++++++++++++---
> > >  1 file changed, 19 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index e9858ca8bbd4..f5b33de4e717 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2915,6 +2915,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	struct vm_area_struct *next, *prev, *merge;
> > >  	pgoff_t pglen = len >> PAGE_SHIFT;
> > >  	unsigned long charged = 0;
> > > +	struct vma_munmap_struct vms;
> > > +	struct ma_state mas_detach;
> > > +	struct maple_tree mt_detach;
> > >  	unsigned long end = addr + len;
> > >  	unsigned long merge_start = addr, merge_end = end;
> > >  	bool writable_file_mapping = false;
> > > @@ -2947,9 +2950,24 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	/* Find the first overlapping VMA */
> > >  	vma = vma_find(&vmi, end);
> > >  	if (vma) {
> > > -		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> > > +		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > > +		mt_on_stack(mt_detach);
> > > +		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> >
> > I'm guessing this is exactly equivalent (or equivalent enough for our purposes
> > here) to the MA_STATE() call in do_vmi_align_munmap()?
>
> Yes, what we are doing is inlining the function call so that it can be
> split to parts.  So for reviewing, I just made it in-line.
>
> >
> > Checking the two against each other it seems that it is indeed.
> >
> > > +		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > > +		if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > > +			return -ENOMEM;
> >
> > In do_vmi_align_munmap() we also invoke vmalidate_mm(), why aren't we doing that
> > here?
>
> I don't see the validate_mm() call in do_vmi_align_munmap, it is called
> in the vms_complete_munmap_vmas() function  though.

In do_vmi_align_munmap() you have:

	error = vms_gather_munmap_vmas(&vms, &mas_detach);
	if (error)
		goto gather_failed;

	...

gather_failed:
	validate_mm(mm);
	return error;

>
> >
> > > +
> > > +		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> > >  			return -ENOMEM;
> >
> > Same here.
>
> It would fail here, our count would be wrong.

Right, but in do_vmi_align_munmap() you have:

	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
	if (error)
		goto clear_tree_failed;

	...

clear_tree_failed:
	abort_munmap_vmas(&mas_detach);
gather_failed:
	validate_mm(mm);


>
> >
> > > +
> > > +		vms_complete_munmap_vmas(&vms, &mas_detach);
> > > +		next = vms.next;
> > > +		prev = vms.prev;
> > > +		vma_prev(&vmi);
> >
> > I'm sure this is correct, but just to double-check - we want to set the VMI to
> > prev here right?
>
> Yes.  In the current function, the vmi points to prev when we are done
> this dance - so I put it there.
> >
> > It might be worth adding a small cmoment saying '/* vmi now points at prev */'
> > or similar, I've found it can get quite hard to follow where the iterator is at
> > sometimes.
>
> So this is about to change..

Yeah, I saw :)

>
> >
> > >  		vma = NULL;
> > > +	} else {
> > > +		next = vma_next(&vmi);
> > > +		prev = vma_prev(&vmi);
> >
> > But here we move forward to the next VMA and set this to next, then go back to
> > the _original_ one and this is prev?
> >
> > Actually I guess if vma == NULL, next gets you to the next, and prev jumps back
> > to prev, with nothing between, and so that makes sense.
>
> Yes, this is what I'm doing.
>
> >
> >
> > >  	}
> > >
> > >  	/*
> > > @@ -2962,8 +2980,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  		vm_flags |= VM_ACCOUNT;
> > >  	}
> > >
> > > -	next = vma_next(&vmi);
> > > -	prev = vma_prev(&vmi);
> > >  	if (vm_flags & VM_SPECIAL) {
> > >  		if (prev)
> > >  			vma_iter_next_range(&vmi);
> > > --
> > > 2.43.0
> > >
> > >


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region()
  2024-07-05 20:18   ` Lorenzo Stoakes
@ 2024-07-05 20:56     ` Liam R. Howlett
  2024-07-08 11:08       ` Lorenzo Stoakes
  0 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-05 20:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 16:18]:
> On Thu, Jul 04, 2024 at 02:27:12PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Instead of moving (or leaving) the vma iterator pointing at the previous
> > vma, leave it pointing at the insert location.  Pointing the vma
> > iterator at the insert location allows for a cleaner walk of the vma
> > tree for MAP_FIXED and the no expansion cases.
> 
> I mean, it's funny I litearlly just asked why it was being left pointing at
> prev and I guess this answers that basically :)
> 
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 28 ++++++++++++++--------------
> >  1 file changed, 14 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f5b33de4e717..ecf55d32e804 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2963,11 +2963,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		vms_complete_munmap_vmas(&vms, &mas_detach);
> >  		next = vms.next;
> >  		prev = vms.prev;
> > -		vma_prev(&vmi);
> >  		vma = NULL;
> >  	} else {
> >  		next = vma_next(&vmi);
> >  		prev = vma_prev(&vmi);
> > +		if (prev)
> > +			vma_iter_next_range(&vmi);
> >  	}
> >
> >  	/*
> > @@ -2980,11 +2981,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		vm_flags |= VM_ACCOUNT;
> >  	}
> >
> > -	if (vm_flags & VM_SPECIAL) {
> > -		if (prev)
> > -			vma_iter_next_range(&vmi);
> > +	if (vm_flags & VM_SPECIAL)
> >  		goto cannot_expand;
> > -	}
> >
> >  	/* Attempt to expand an old mapping */
> >  	/* Check next */
> > @@ -3005,19 +3003,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		merge_start = prev->vm_start;
> >  		vma = prev;
> >  		vm_pgoff = prev->vm_pgoff;
> > -	} else if (prev) {
> > -		vma_iter_next_range(&vmi);
> > +		vma_prev(&vmi);
> 
> Why not vma_iter_prev_range()?

In any situation where we arrive at prev being able to merge with the
current, the prev->end == addr and we are already pointing at the gap
with addr.  So vma_iter_prev_range() and vma_prev() will do the same
thing here.

I thought a lot about this (and created a small test application since
it is so difficult to test vma merging right now..).  If we do not clear
the gap, it is possible we will be pointing at the first vma in the
range, and not addr at all when we call vms_gather_munmap_vmas(), but
the loop at the end of the gather function ensures the vmi is at the
correct address (vms->start is in the range). So the result is that we
consistently point to the correct location and so these two calls will
do the same thing.

> 
> >  	}
> >
> > -	/* Actually expand, if possible */
> > -	if (vma &&
> > -	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> > -		khugepaged_enter_vma(vma, vm_flags);
> > -		goto expanded;
> > +	if (vma) {
> > +		/* Actually expand, if possible */
> > +		if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> > +			khugepaged_enter_vma(vma, vm_flags);
> > +			goto expanded;
> > +		}
> > +
> > +		/* If the expand fails, then reposition the vma iterator */
> > +		if (unlikely(vma == prev))
> > +			vma_iter_set(&vmi, addr);
> >  	}
> >
> > -	if (vma == prev)
> > -		vma_iter_set(&vmi, addr);
> >  cannot_expand:
> >
> >  	/*
> > --
> > 2.43.0
> >
> Looks good to me,
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region()
  2024-07-05 20:56     ` Liam R. Howlett
@ 2024-07-08 11:08       ` Lorenzo Stoakes
  2024-07-08 16:43         ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-08 11:08 UTC (permalink / raw)
  To: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook

On Fri, Jul 05, 2024 at 04:56:34PM GMT, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 16:18]:
> > On Thu, Jul 04, 2024 at 02:27:12PM GMT, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > Instead of moving (or leaving) the vma iterator pointing at the previous
> > > vma, leave it pointing at the insert location.  Pointing the vma
> > > iterator at the insert location allows for a cleaner walk of the vma
> > > tree for MAP_FIXED and the no expansion cases.
> >
> > I mean, it's funny I litearlly just asked why it was being left pointing at
> > prev and I guess this answers that basically :)
> >
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > ---
> > >  mm/mmap.c | 28 ++++++++++++++--------------
> > >  1 file changed, 14 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index f5b33de4e717..ecf55d32e804 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2963,11 +2963,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  		vms_complete_munmap_vmas(&vms, &mas_detach);
> > >  		next = vms.next;
> > >  		prev = vms.prev;
> > > -		vma_prev(&vmi);
> > >  		vma = NULL;
> > >  	} else {
> > >  		next = vma_next(&vmi);
> > >  		prev = vma_prev(&vmi);
> > > +		if (prev)
> > > +			vma_iter_next_range(&vmi);
> > >  	}
> > >
> > >  	/*
> > > @@ -2980,11 +2981,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  		vm_flags |= VM_ACCOUNT;
> > >  	}
> > >
> > > -	if (vm_flags & VM_SPECIAL) {
> > > -		if (prev)
> > > -			vma_iter_next_range(&vmi);
> > > +	if (vm_flags & VM_SPECIAL)
> > >  		goto cannot_expand;
> > > -	}
> > >
> > >  	/* Attempt to expand an old mapping */
> > >  	/* Check next */
> > > @@ -3005,19 +3003,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  		merge_start = prev->vm_start;
> > >  		vma = prev;
> > >  		vm_pgoff = prev->vm_pgoff;
> > > -	} else if (prev) {
> > > -		vma_iter_next_range(&vmi);
> > > +		vma_prev(&vmi);
> >
> > Why not vma_iter_prev_range()?
>
> In any situation where we arrive at prev being able to merge with the
> current, the prev->end == addr and we are already pointing at the gap
> with addr.  So vma_iter_prev_range() and vma_prev() will do the same
> thing here.
>
> I thought a lot about this (and created a small test application since
> it is so difficult to test vma merging right now..).  If we do not clear
> the gap, it is possible we will be pointing at the first vma in the
> range, and not addr at all when we call vms_gather_munmap_vmas(), but
> the loop at the end of the gather function ensures the vmi is at the
> correct address (vms->start is in the range). So the result is that we
> consistently point to the correct location and so these two calls will
> do the same thing.

Might it be worth a comment here?

>
> >
> > >  	}
> > >
> > > -	/* Actually expand, if possible */
> > > -	if (vma &&
> > > -	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> > > -		khugepaged_enter_vma(vma, vm_flags);
> > > -		goto expanded;
> > > +	if (vma) {
> > > +		/* Actually expand, if possible */
> > > +		if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> > > +			khugepaged_enter_vma(vma, vm_flags);
> > > +			goto expanded;
> > > +		}
> > > +
> > > +		/* If the expand fails, then reposition the vma iterator */
> > > +		if (unlikely(vma == prev))
> > > +			vma_iter_set(&vmi, addr);
> > >  	}
> > >
> > > -	if (vma == prev)
> > > -		vma_iter_set(&vmi, addr);
> > >  cannot_expand:
> > >
> > >  	/*
> > > --
> > > 2.43.0
> > >
> > Looks good to me,
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-07-04 18:27 ` [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-07-08 12:18   ` Lorenzo Stoakes
  2024-07-08 19:10     ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-08 12:18 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:15PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Instead of zeroing the vma tree and then overwriting the area, let the
> area be overwritten and then clean up the gathered vmas using
> vms_complete_munmap_vmas().
>
> In the case of a driver mapping over existing vmas, the PTEs are cleared
> using the helper vms_complete_pte_clear().
>
> Temporarily keep track of the number of pages that will be removed and
> reduce the charged amount.
>
> This also drops the validate_mm() call in the vma_expand() function.
> It is necessary to drop the validate as it would fail since the mm
> map_count would be incorrect during a vma expansion, prior to the
> cleanup from vms_complete_munmap_vmas().
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/internal.h |  1 +
>  mm/mmap.c     | 61 ++++++++++++++++++++++++++++++---------------------
>  2 files changed, 37 insertions(+), 25 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 4c9f06669cc4..fae4a1bba732 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
>  	unsigned long stack_vm;
>  	unsigned long data_vm;
>  	bool unlock;			/* Unlock after the munmap */
> +	bool cleared_ptes;		/* If the PTE are cleared already */
>  };
>
>  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5d458c5f080e..0c334eeae8cd 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -401,17 +401,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
>  }
>
>  static unsigned long count_vma_pages_range(struct mm_struct *mm,
> -		unsigned long addr, unsigned long end)
> +		unsigned long addr, unsigned long end,
> +		unsigned long *nr_accounted)
>  {
>  	VMA_ITERATOR(vmi, mm, addr);
>  	struct vm_area_struct *vma;
>  	unsigned long nr_pages = 0;
>
> +	*nr_accounted = 0;
>  	for_each_vma_range(vmi, vma, end) {
>  		unsigned long vm_start = max(addr, vma->vm_start);
>  		unsigned long vm_end = min(end, vma->vm_end);
>
>  		nr_pages += PHYS_PFN(vm_end - vm_start);
> +		if (vma->vm_flags & VM_ACCOUNT)
> +			*nr_accounted += PHYS_PFN(vm_end - vm_start);

We're duplicating the PHYS_PFN(vm_end - vm_start) thing, probably worth
adding something like:

unsigned long num_pages = PHYS_PFN(vm_end - vm_start);

Side-note, but it'd be nice to sort out the inconsistency of PHYS_PFN()
vs. (end - start) >> PAGE_SHIFT. This is probably not a huge deal though...

>  	}
>
>  	return nr_pages;
> @@ -522,6 +526,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
>  	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
>  	vms->unmap_start = FIRST_USER_ADDRESS;
>  	vms->unmap_end = USER_PGTABLES_CEILING;
> +	vms->cleared_ptes = false;
>  }
>
>  /*
> @@ -730,7 +735,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	vma_iter_store(vmi, vma);
>
>  	vma_complete(&vp, vmi, vma->vm_mm);
> -	validate_mm(vma->vm_mm);

Since we're dropping this here, do we need to re-add this back somehwere
where we are confident the state will be consistent?

>  	return 0;
>
>  nomem:
> @@ -2612,6 +2616,9 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
>  {
>  	struct mmu_gather tlb;
>
> +	if (vms->cleared_ptes)
> +		return;
> +
>  	/*
>  	 * We can free page tables without write-locking mmap_lock because VMAs
>  	 * were isolated before we downgraded mmap_lock.
> @@ -2624,6 +2631,7 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
>  	mas_set(mas_detach, 1);
>  	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
>  	tlb_finish_mmu(&tlb);
> +	vms->cleared_ptes = true;
>  }
>
>  /*
> @@ -2936,24 +2944,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	unsigned long merge_start = addr, merge_end = end;
>  	bool writable_file_mapping = false;
>  	pgoff_t vm_pgoff;
> -	int error;
> +	int error = -ENOMEM;
>  	VMA_ITERATOR(vmi, mm, addr);
> +	unsigned long nr_pages, nr_accounted;
>
> -	/* Check against address space limit. */
> -	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> -		unsigned long nr_pages;
> -
> -		/*
> -		 * MAP_FIXED may remove pages of mappings that intersects with
> -		 * requested mapping. Account for the pages it would unmap.
> -		 */
> -		nr_pages = count_vma_pages_range(mm, addr, end);
> -
> -		if (!may_expand_vm(mm, vm_flags,
> -					(len >> PAGE_SHIFT) - nr_pages))
> -			return -ENOMEM;
> -	}
> +	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
>
> +	/* Check against address space limit. */
> +	/*
> +	 * MAP_FIXED may remove pages of mappings that intersects with requested
> +	 * mapping. Account for the pages it would unmap.
> +	 */

Utter pedantry, but could these comments be combined? Bit ugly to have one
after another like this.

> +	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> +		return -ENOMEM;
>
>  	if (unlikely(!can_modify_mm(mm, addr, end)))
>  		return -EPERM;
> @@ -2971,14 +2974,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		if (vms_gather_munmap_vmas(&vms, &mas_detach))
>  			return -ENOMEM;
>
> -		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> -			return -ENOMEM;
> -
> -		vms_complete_munmap_vmas(&vms, &mas_detach);
>  		next = vms.next;
>  		prev = vms.prev;
>  		vma = NULL;
>  	} else {
> +		/* Minimal setup of vms */
> +		vms.nr_pages = 0;

I'm not a huge fan of having vms be uninitialised other than this field and
then to rely on no further code change accidentally using an uninitialised
field. This is kind of asking for bugs.

Can we not find a way to sensibly initialise it somehow?

>  		next = vma_next(&vmi);
>  		prev = vma_prev(&vmi);
>  		if (prev)
> @@ -2990,8 +2991,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 */
>  	if (accountable_mapping(file, vm_flags)) {
>  		charged = len >> PAGE_SHIFT;
> +		charged -= nr_accounted;
>  		if (security_vm_enough_memory_mm(mm, charged))
> -			return -ENOMEM;
> +			goto abort_munmap;
> +		vms.nr_accounted = 0;

This is kind of expanding the 'vms possibly unitialised apart from selected
fields' pattern, makes me worry.

>  		vm_flags |= VM_ACCOUNT;
>  	}
>
> @@ -3040,10 +3043,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 * not unmapped, but the maps are removed from the list.
>  	 */
>  	vma = vm_area_alloc(mm);
> -	if (!vma) {
> -		error = -ENOMEM;
> +	if (!vma)
>  		goto unacct_error;
> -	}
>
>  	vma_iter_config(&vmi, addr, end);
>  	vma_set_range(vma, addr, end, pgoff);
> @@ -3052,6 +3053,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
>  	if (file) {
>  		vma->vm_file = get_file(file);
> +		/* call_mmap() map PTE, so ensure there are no existing PTEs */

Typo? Should this be 'call_mmap() maps PTEs, so ensure there are no
existing PTEs'? I feel like this could be reworded something like:

'call_map() may map PTEs, so clear any that may be pending unmap ahead of
time.'

> +		if (vms.nr_pages)
> +			vms_complete_pte_clear(&vms, &mas_detach, true);
>  		error = call_mmap(file, vma);
>  		if (error)
>  			goto unmap_and_free_vma;
> @@ -3142,6 +3146,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  expanded:
>  	perf_event_mmap(vma);
>
> +	if (vms.nr_pages)
> +		vms_complete_munmap_vmas(&vms, &mas_detach);
> +

Hang on, if we already did this in the if (file) branch above, might we end
up calling this twice? I didn't see vms.nr_pages get set to zero or
decremented anywhere (unless I missed it)?

>  	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
>  	if (vm_flags & VM_LOCKED) {
>  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> @@ -3189,6 +3196,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  unacct_error:
>  	if (charged)
>  		vm_unacct_memory(charged);
> +
> +abort_munmap:
> +	if (vms.nr_pages)
> +		abort_munmap_vmas(&mas_detach);
>  	validate_mm(mm);
>  	return error;
>  }
> --
> 2.43.0
>

In general I like the approach and you've made it very clear how you've
altered this behaviour.

However I have a few concerns (as well some trivial comments) above. With
those cleared up we'll be good to go!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 14/16] mm/mmap: Use PHYS_PFN in mmap_region()
  2024-07-04 18:27 ` [PATCH v3 14/16] mm/mmap: Use PHYS_PFN " Liam R. Howlett
@ 2024-07-08 12:21   ` Lorenzo Stoakes
  2024-07-09 18:35     ` Liam R. Howlett
  2024-07-10 17:32     ` Suren Baghdasaryan
  0 siblings, 2 replies; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-08 12:21 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:16PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Instead of shifting the length by PAGE_SIZE, use PHYS_PFN.  Also use the
> existing local variable everywhere instead of some of the time.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 0c334eeae8cd..b14da6bd257f 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2935,7 +2935,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = NULL;
>  	struct vm_area_struct *next, *prev, *merge;
> -	pgoff_t pglen = len >> PAGE_SHIFT;
> +	pgoff_t pglen = PHYS_PFN(len);
>  	unsigned long charged = 0;
>  	struct vma_munmap_struct vms;
>  	struct ma_state mas_detach;
> @@ -2955,7 +2955,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 * MAP_FIXED may remove pages of mappings that intersects with requested
>  	 * mapping. Account for the pages it would unmap.
>  	 */
> -	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> +	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
>  		return -ENOMEM;
>
>  	if (unlikely(!can_modify_mm(mm, addr, end)))
> @@ -2990,7 +2990,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 * Private writable mapping: check memory availability
>  	 */
>  	if (accountable_mapping(file, vm_flags)) {
> -		charged = len >> PAGE_SHIFT;
> +		charged = pglen;
>  		charged -= nr_accounted;
>  		if (security_vm_enough_memory_mm(mm, charged))
>  			goto abort_munmap;
> @@ -3149,14 +3149,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	if (vms.nr_pages)
>  		vms_complete_munmap_vmas(&vms, &mas_detach);
>
> -	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> +	vm_stat_account(mm, vm_flags, pglen);
>  	if (vm_flags & VM_LOCKED) {
>  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>  					is_vm_hugetlb_page(vma) ||
>  					vma == get_gate_vma(current->mm))
>  			vm_flags_clear(vma, VM_LOCKED_MASK);
>  		else
> -			mm->locked_vm += (len >> PAGE_SHIFT);
> +			mm->locked_vm += pglen;
>  	}
>
>  	if (file)
> --
> 2.43.0
>

Maybe I should literally look ahead before making comments :)) thanks for
reading my mind and doing what I asked though! ;)

However I don't think you've fixed the duplication of PHYS_PFN(vm_end -
vm_start) in count_vma_pages_range() - still worth doing I think.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 15/16] mm/mmap: Use vms accounted pages in mmap_region()
  2024-07-04 18:27 ` [PATCH v3 15/16] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-07-08 12:43   ` Lorenzo Stoakes
  2024-07-10 17:43     ` Suren Baghdasaryan
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-08 12:43 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:17PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Change from nr_pages variable to vms.nr_accounted for the charged pages
> calculation.  This is necessary for a future patch.
>
> This also avoids checking security_vm_enough_memory_mm() if the amount
> of memory won't change.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Cc: Kees Cook <kees@kernel.org>
> ---
>  mm/mmap.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b14da6bd257f..b2de26683903 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2980,6 +2980,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	} else {
>  		/* Minimal setup of vms */
>  		vms.nr_pages = 0;
> +		vms.nr_accounted = 0;

This kind of highlights my concern about only setting some vms fields, now we
have to remember to change this in the right place or happen to know that
init_vma_munmap() will be otherwise invoked.

>  		next = vma_next(&vmi);
>  		prev = vma_prev(&vmi);
>  		if (prev)
> @@ -2991,9 +2992,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 */
>  	if (accountable_mapping(file, vm_flags)) {
>  		charged = pglen;
> -		charged -= nr_accounted;
> -		if (security_vm_enough_memory_mm(mm, charged))
> +		charged -= vms.nr_accounted;
> +		if (charged && security_vm_enough_memory_mm(mm, charged))
>  			goto abort_munmap;
> +
>  		vms.nr_accounted = 0;

Is setting this to zero really needed here? We may be done with this, but if the
vms value represents the 'unmap state' of this range, surely the number of
accountable pages remains the same?

>  		vm_flags |= VM_ACCOUNT;
>  	}
> --
> 2.43.0
>

At this point nr_accounted is no longer used, but I'm guessing a follow up patch
will remove this? :)

I was wondering why you used that given the gather function also separately
calculates it, but I guess this answers that!

Generally this looks good to me, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-04 18:27 ` [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check " Liam R. Howlett
@ 2024-07-08 12:52   ` Lorenzo Stoakes
  2024-07-08 20:43     ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-08 12:52 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> call, so use it instead of looping over the vmas twice.

Predictably indeed you removed the thing I commented on in the last patch
;) but at least this time I predicted it! ;)

>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 36 ++++--------------------------------
>  1 file changed, 4 insertions(+), 32 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b2de26683903..62edaabf3987 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
>  		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
>  }
>
> -static unsigned long count_vma_pages_range(struct mm_struct *mm,
> -		unsigned long addr, unsigned long end,
> -		unsigned long *nr_accounted)
> -{
> -	VMA_ITERATOR(vmi, mm, addr);
> -	struct vm_area_struct *vma;
> -	unsigned long nr_pages = 0;
> -
> -	*nr_accounted = 0;
> -	for_each_vma_range(vmi, vma, end) {
> -		unsigned long vm_start = max(addr, vma->vm_start);
> -		unsigned long vm_end = min(end, vma->vm_end);
> -
> -		nr_pages += PHYS_PFN(vm_end - vm_start);
> -		if (vma->vm_flags & VM_ACCOUNT)
> -			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> -	}
> -
> -	return nr_pages;
> -}
> -
>  static void __vma_link_file(struct vm_area_struct *vma,
>  			    struct address_space *mapping)
>  {
> @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	pgoff_t vm_pgoff;
>  	int error = -ENOMEM;
>  	VMA_ITERATOR(vmi, mm, addr);
> -	unsigned long nr_pages, nr_accounted;
> -
> -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> -
> -	/* Check against address space limit. */
> -	/*
> -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> -	 * mapping. Account for the pages it would unmap.
> -	 */
> -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> -		return -ENOMEM;
>
>  	if (unlikely(!can_modify_mm(mm, addr, end)))
>  		return -EPERM;
> @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  			vma_iter_next_range(&vmi);
>  	}
>
> +	/* Check against address space limit. */
> +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> +		goto abort_munmap;
> +

I know you can literally only do this after the vms_gather_munmap_vmas(),
but this does change where we check this, so for instance we do
arch_unmap() without having checked may_expand_vm().

However I assume this is fine?

>  	/*
>  	 * Private writable mapping: check memory availability
>  	 */
> --
> 2.43.0
>

Looks good to me generally,

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct
  2024-07-05 20:27   ` Lorenzo Stoakes
@ 2024-07-08 14:45     ` Liam R. Howlett
  2024-07-10 17:14     ` Suren Baghdasaryan
  1 sibling, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-08 14:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 16:27]:
> On Thu, Jul 04, 2024 at 02:27:13PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Set the start and end address for munmap when the prev and next are
> > gathered.  This is needed to avoid incorrect addresses being used during
> > the vms_complete_munmap_vmas() function if the prev/next vma are
> > expanded.
> 
> When we spoke about this separately you mentioned that specific arches may
> be more likely to encounter this issue, perhaps worth mentioning something
> about that in the commit msg? Unless I misunderstood you.

What we spoke about was mappings outside vmas, that is between two vmas
may have mappings on certain archs - I'm not entirely sure on this or if
it's still something we have to worry about.  That is, we use the
prev->vm_end and next->vm_start as the unmapping range instead of the
actual vma start and end.

There is also the upper and lower limits if prev or next does not exist.
See git id 6ee8630e02be6, and e2cdef8c847b4 - probably from an older git
history than kernel.org: https://github.com/mpe/linux-fullhistory.git

What I am trying to avoid here is using the prev->vm_end address for
munmap when we are changing the prev->vm_end to expand over the area we
are mapping.  And the same for expanding next backwards.

> 
> >
> > Add a new helper vms_complete6ee8630e02be6_pte_clear(), which is needed later and
> > will avoid growing the argument list to unmap_region() beyond the 9 it
> > already has.
> 
> My word.
> 
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/internal.h |  2 ++
> >  mm/mmap.c     | 34 +++++++++++++++++++++++++++-------
> >  2 files changed, 29 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 8cbbbe7d40f3..4c9f06669cc4 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1493,6 +1493,8 @@ struct vma_munmap_struct {
> >  	struct list_head *uf;		/* Userfaultfd list_head */
> >  	unsigned long start;		/* Aligned start addr */
> >  	unsigned long end;		/* Aligned end addr */
> > +	unsigned long unmap_start;
> > +	unsigned long unmap_end;
> >  	int vma_count;			/* Number of vmas that will be removed */
> >  	unsigned long nr_pages;		/* Number of pages being removed */
> >  	unsigned long locked_vm;	/* Number of locked pages */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index ecf55d32e804..45443a53be76 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -525,6 +525,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> >  	vms->vma_count = 0;
> >  	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
> >  	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> > +	vms->unmap_start = FIRST_USER_ADDRESS;
> > +	vms->unmap_end = USER_PGTABLES_CEILING;
> >  }
> >
> >  /*
> > @@ -2610,6 +2612,26 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> >  	__mt_destroy(mas_detach->tree);
> >  }
> >
> > +
> > +static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > +		struct ma_state *mas_detach, bool mm_wr_locked)
> > +{
> > +	struct mmu_gather tlb;
> > +
> > +	/*
> > +	 * We can free page tables without write-locking mmap_lock because VMAs
> > +	 * were isolated before we downgraded mmap_lock.
> > +	 */
> > +	mas_set(mas_detach, 1);
> > +	lru_add_drain();
> > +	tlb_gather_mmu(&tlb, vms->mm);
> > +	update_hiwater_rss(vms->mm);
> > +	unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
> > +	mas_set(mas_detach, 1);
> 
> I know it's necessary as unmap_vmas() will adjust mas_detach, but it kind
> of aesthetically sucks to set it to 1, do some stuff, then set it to 1
> again. But this is not a big deal :>)
> 
> > +	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
> 
> Yeah this bit definitely needs a comment I think, this is very confusing
> indeed. Under what circumstances will these differ from [vms->start,
> vms->end), etc.?
> 
> I'm guessing it's to do with !vms->prev and !vms->next needing to be set to
> [FIRST_USER_ADDRESS, USER_PGTABLES_CEILING)?

Yes, exactly.  Since we are setting the range to unmap, we can just set
it to the correct value during the gather stage of the VMAs.

> 
> > +	tlb_finish_mmu(&tlb);
> > +}
> > +
> >  /*
> >   * vms_complete_munmap_vmas() - Finish the munmap() operation
> >   * @vms: The vma munmap struct
> > @@ -2631,13 +2653,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> >  	if (vms->unlock)
> >  		mmap_write_downgrade(mm);
> >
> > -	/*
> > -	 * We can free page tables without write-locking mmap_lock because VMAs
> > -	 * were isolated before we downgraded mmap_lock.
> > -	 */
> > -	mas_set(mas_detach, 1);
> > -	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
> > -		     vms->start, vms->end, vms->vma_count, !vms->unlock);
> > +	vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
> >  	/* Update high watermark before we lower total_vm */
> >  	update_hiwater_vm(mm);
> >  	/* Stat accounting */
> > @@ -2699,6 +2715,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >  			goto start_split_failed;
> >  	}
> >  	vms->prev = vma_prev(vms->vmi);
> > +	if (vms->prev)
> > +		vms->unmap_start = vms->prev->vm_end;
> >
> >  	/*
> >  	 * Detach a range of VMAs from the mm. Using next as a temp variable as
> > @@ -2757,6 +2775,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >  	}
> >
> >  	vms->next = vma_next(vms->vmi);
> > +	if (vms->next)
> > +		vms->unmap_end = vms->next->vm_start;
> >
> >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> >  	/* Make sure no VMAs are about to be lost. */
> > --
> > 2.43.0
> >
> 
> Other than wanting some extra comments, this looks fine and I know how
> hard-won the unmap range bit of this change was so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call
  2024-07-05 20:36       ` Lorenzo Stoakes
@ 2024-07-08 14:49         ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-08 14:49 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 16:36]:
> On Fri, Jul 05, 2024 at 04:30:46PM GMT, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 16:06]:
> > > On Thu, Jul 04, 2024 at 02:27:11PM GMT, Liam R. Howlett wrote:
> > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > >
> > > > Open code the do_vmi_align_munmap() call so that it can be broken up
> > > > later in the series.
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > ---
> > > >  mm/mmap.c | 22 +++++++++++++++++++---
> > > >  1 file changed, 19 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index e9858ca8bbd4..f5b33de4e717 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2915,6 +2915,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  	struct vm_area_struct *next, *prev, *merge;
> > > >  	pgoff_t pglen = len >> PAGE_SHIFT;
> > > >  	unsigned long charged = 0;
> > > > +	struct vma_munmap_struct vms;
> > > > +	struct ma_state mas_detach;
> > > > +	struct maple_tree mt_detach;
> > > >  	unsigned long end = addr + len;
> > > >  	unsigned long merge_start = addr, merge_end = end;
> > > >  	bool writable_file_mapping = false;
> > > > @@ -2947,9 +2950,24 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  	/* Find the first overlapping VMA */
> > > >  	vma = vma_find(&vmi, end);
> > > >  	if (vma) {
> > > > -		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> > > > +		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > > > +		mt_on_stack(mt_detach);
> > > > +		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > >
> > > I'm guessing this is exactly equivalent (or equivalent enough for our purposes
> > > here) to the MA_STATE() call in do_vmi_align_munmap()?
> >
> > Yes, what we are doing is inlining the function call so that it can be
> > split to parts.  So for reviewing, I just made it in-line.
> >
> > >
> > > Checking the two against each other it seems that it is indeed.
> > >
> > > > +		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > > > +		if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > > > +			return -ENOMEM;
> > >
> > > In do_vmi_align_munmap() we also invoke vmalidate_mm(), why aren't we doing that
> > > here?
> >
> > I don't see the validate_mm() call in do_vmi_align_munmap, it is called
> > in the vms_complete_munmap_vmas() function  though.
> 
> In do_vmi_align_munmap() you have:
> 
> 	error = vms_gather_munmap_vmas(&vms, &mas_detach);
> 	if (error)
> 		goto gather_failed;
> 
> 	...
> 
> gather_failed:
> 	validate_mm(mm);
> 	return error;
> 
> >
> > >
> > > > +
> > > > +		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> > > >  			return -ENOMEM;
> > >
> > > Same here.
> >
> > It would fail here, our count would be wrong.
> 
> Right, but in do_vmi_align_munmap() you have:
> 
> 	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
> 	if (error)
> 		goto clear_tree_failed;
> 
> 	...
> 
> clear_tree_failed:
> 	abort_munmap_vmas(&mas_detach);
> gather_failed:
> 	validate_mm(mm);
> 

Right, so I also missed my cleanup in the abort_munmap_vmas() here..
Thanks, I'll fix these.

> 
> >
> > >
> > > > +
> > > > +		vms_complete_munmap_vmas(&vms, &mas_detach);
> > > > +		next = vms.next;
> > > > +		prev = vms.prev;
> > > > +		vma_prev(&vmi);
> > >
> > > I'm sure this is correct, but just to double-check - we want to set the VMI to
> > > prev here right?
> >
> > Yes.  In the current function, the vmi points to prev when we are done
> > this dance - so I put it there.
> > >
> > > It might be worth adding a small cmoment saying '/* vmi now points at prev */'
> > > or similar, I've found it can get quite hard to follow where the iterator is at
> > > sometimes.
> >
> > So this is about to change..
> 
> Yeah, I saw :)
> 
> >
> > >
> > > >  		vma = NULL;
> > > > +	} else {
> > > > +		next = vma_next(&vmi);
> > > > +		prev = vma_prev(&vmi);
> > >
> > > But here we move forward to the next VMA and set this to next, then go back to
> > > the _original_ one and this is prev?
> > >
> > > Actually I guess if vma == NULL, next gets you to the next, and prev jumps back
> > > to prev, with nothing between, and so that makes sense.
> >
> > Yes, this is what I'm doing.
> >
> > >
> > >
> > > >  	}
> > > >
> > > >  	/*
> > > > @@ -2962,8 +2980,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  		vm_flags |= VM_ACCOUNT;
> > > >  	}
> > > >
> > > > -	next = vma_next(&vmi);
> > > > -	prev = vma_prev(&vmi);
> > > >  	if (vm_flags & VM_SPECIAL) {
> > > >  		if (prev)
> > > >  			vma_iter_next_range(&vmi);
> > > > --
> > > > 2.43.0
> > > >
> > > >


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region()
  2024-07-08 11:08       ` Lorenzo Stoakes
@ 2024-07-08 16:43         ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-08 16:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 07:08]:
> On Fri, Jul 05, 2024 at 04:56:34PM GMT, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 16:18]:
> > > On Thu, Jul 04, 2024 at 02:27:12PM GMT, Liam R. Howlett wrote:
> > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > >
> > > > Instead of moving (or leaving) the vma iterator pointing at the previous
> > > > vma, leave it pointing at the insert location.  Pointing the vma
> > > > iterator at the insert location allows for a cleaner walk of the vma
> > > > tree for MAP_FIXED and the no expansion cases.
> > >
> > > I mean, it's funny I litearlly just asked why it was being left pointing at
> > > prev and I guess this answers that basically :)
> > >
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > ---
> > > >  mm/mmap.c | 28 ++++++++++++++--------------
> > > >  1 file changed, 14 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index f5b33de4e717..ecf55d32e804 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2963,11 +2963,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  		vms_complete_munmap_vmas(&vms, &mas_detach);
> > > >  		next = vms.next;
> > > >  		prev = vms.prev;
> > > > -		vma_prev(&vmi);
> > > >  		vma = NULL;
> > > >  	} else {
> > > >  		next = vma_next(&vmi);
> > > >  		prev = vma_prev(&vmi);
> > > > +		if (prev)
> > > > +			vma_iter_next_range(&vmi);
> > > >  	}
> > > >
> > > >  	/*
> > > > @@ -2980,11 +2981,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  		vm_flags |= VM_ACCOUNT;
> > > >  	}
> > > >
> > > > -	if (vm_flags & VM_SPECIAL) {
> > > > -		if (prev)
> > > > -			vma_iter_next_range(&vmi);
> > > > +	if (vm_flags & VM_SPECIAL)
> > > >  		goto cannot_expand;
> > > > -	}
> > > >
> > > >  	/* Attempt to expand an old mapping */
> > > >  	/* Check next */
> > > > @@ -3005,19 +3003,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  		merge_start = prev->vm_start;
> > > >  		vma = prev;
> > > >  		vm_pgoff = prev->vm_pgoff;
> > > > -	} else if (prev) {
> > > > -		vma_iter_next_range(&vmi);
> > > > +		vma_prev(&vmi);
> > >
> > > Why not vma_iter_prev_range()?
> >
> > In any situation where we arrive at prev being able to merge with the
> > current, the prev->end == addr and we are already pointing at the gap
> > with addr.  So vma_iter_prev_range() and vma_prev() will do the same
> > thing here.
> >
> > I thought a lot about this (and created a small test application since
> > it is so difficult to test vma merging right now..).  If we do not clear
> > the gap, it is possible we will be pointing at the first vma in the
> > range, and not addr at all when we call vms_gather_munmap_vmas(), but
> > the loop at the end of the gather function ensures the vmi is at the
> > correct address (vms->start is in the range). So the result is that we
> > consistently point to the correct location and so these two calls will
> > do the same thing.
> 
> Might it be worth a comment here?

Okay, I'll add something.

> 
> >
> > >
> > > >  	}
> > > >
> > > > -	/* Actually expand, if possible */
> > > > -	if (vma &&
> > > > -	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> > > > -		khugepaged_enter_vma(vma, vm_flags);
> > > > -		goto expanded;
> > > > +	if (vma) {
> > > > +		/* Actually expand, if possible */
> > > > +		if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> > > > +			khugepaged_enter_vma(vma, vm_flags);
> > > > +			goto expanded;
> > > > +		}
> > > > +
> > > > +		/* If the expand fails, then reposition the vma iterator */
> > > > +		if (unlikely(vma == prev))
> > > > +			vma_iter_set(&vmi, addr);
> > > >  	}
> > > >
> > > > -	if (vma == prev)
> > > > -		vma_iter_set(&vmi, addr);
> > > >  cannot_expand:
> > > >
> > > >  	/*
> > > > --
> > > > 2.43.0
> > > >
> > > Looks good to me,
> > >
> > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > Thanks!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-07-08 12:18   ` Lorenzo Stoakes
@ 2024-07-08 19:10     ` Liam R. Howlett
  2024-07-09 14:27       ` Lorenzo Stoakes
  0 siblings, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-08 19:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:18]:
> On Thu, Jul 04, 2024 at 02:27:15PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Instead of zeroing the vma tree and then overwriting the area, let the
> > area be overwritten and then clean up the gathered vmas using
> > vms_complete_munmap_vmas().
> >
> > In the case of a driver mapping over existing vmas, the PTEs are cleared
> > using the helper vms_complete_pte_clear().
> >
> > Temporarily keep track of the number of pages that will be removed and
> > reduce the charged amount.
> >
> > This also drops the validate_mm() call in the vma_expand() function.
> > It is necessary to drop the validate as it would fail since the mm
> > map_count would be incorrect during a vma expansion, prior to the
> > cleanup from vms_complete_munmap_vmas().
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/internal.h |  1 +
> >  mm/mmap.c     | 61 ++++++++++++++++++++++++++++++---------------------
> >  2 files changed, 37 insertions(+), 25 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 4c9f06669cc4..fae4a1bba732 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
> >  	unsigned long stack_vm;
> >  	unsigned long data_vm;
> >  	bool unlock;			/* Unlock after the munmap */
> > +	bool cleared_ptes;		/* If the PTE are cleared already */
> >  };
> >
> >  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 5d458c5f080e..0c334eeae8cd 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -401,17 +401,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> >  }
> >
> >  static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > -		unsigned long addr, unsigned long end)
> > +		unsigned long addr, unsigned long end,
> > +		unsigned long *nr_accounted)
> >  {
> >  	VMA_ITERATOR(vmi, mm, addr);
> >  	struct vm_area_struct *vma;
> >  	unsigned long nr_pages = 0;
> >
> > +	*nr_accounted = 0;
> >  	for_each_vma_range(vmi, vma, end) {
> >  		unsigned long vm_start = max(addr, vma->vm_start);
> >  		unsigned long vm_end = min(end, vma->vm_end);
> >
> >  		nr_pages += PHYS_PFN(vm_end - vm_start);
> > +		if (vma->vm_flags & VM_ACCOUNT)
> > +			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> 
> We're duplicating the PHYS_PFN(vm_end - vm_start) thing, probably worth
> adding something like:
> 
> unsigned long num_pages = PHYS_PFN(vm_end - vm_start);
> 
> Side-note, but it'd be nice to sort out the inconsistency of PHYS_PFN()
> vs. (end - start) >> PAGE_SHIFT. This is probably not a huge deal though...

I split this out into another patch for easier reviewing.

> 
> >  	}
> >
> >  	return nr_pages;
> > @@ -522,6 +526,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> >  	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> >  	vms->unmap_start = FIRST_USER_ADDRESS;
> >  	vms->unmap_end = USER_PGTABLES_CEILING;
> > +	vms->cleared_ptes = false;
> >  }
> >
> >  /*
> > @@ -730,7 +735,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  	vma_iter_store(vmi, vma);
> >
> >  	vma_complete(&vp, vmi, vma->vm_mm);
> > -	validate_mm(vma->vm_mm);
> 
> Since we're dropping this here, do we need to re-add this back somehwere
> where we are confident the state will be consistent?

The vma_expand() function is used in two places - one is in the mmap.c
file which can no longer validate the mm until the munmap is complete.
The other is in fs/exec.c which cannot call the validate_mm().  So
to add this call back, I'd have to add a wrapper to vma_expand() to call
the validate_mm() function for debug builds.

Really all this code in fs/exec.c doesn't belong there so we don't need
to do an extra function wrapper just to call validate_mm(). And you have
a patch to do that which is out for review!

> 
> >  	return 0;
> >
> >  nomem:
> > @@ -2612,6 +2616,9 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> >  {
> >  	struct mmu_gather tlb;
> >
> > +	if (vms->cleared_ptes)
> > +		return;
> > +
> >  	/*
> >  	 * We can free page tables without write-locking mmap_lock because VMAs
> >  	 * were isolated before we downgraded mmap_lock.
> > @@ -2624,6 +2631,7 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> >  	mas_set(mas_detach, 1);
> >  	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
> >  	tlb_finish_mmu(&tlb);
> > +	vms->cleared_ptes = true;
> >  }
> >
> >  /*
> > @@ -2936,24 +2944,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	unsigned long merge_start = addr, merge_end = end;
> >  	bool writable_file_mapping = false;
> >  	pgoff_t vm_pgoff;
> > -	int error;
> > +	int error = -ENOMEM;
> >  	VMA_ITERATOR(vmi, mm, addr);
> > +	unsigned long nr_pages, nr_accounted;
> >
> > -	/* Check against address space limit. */
> > -	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> > -		unsigned long nr_pages;
> > -
> > -		/*
> > -		 * MAP_FIXED may remove pages of mappings that intersects with
> > -		 * requested mapping. Account for the pages it would unmap.
> > -		 */
> > -		nr_pages = count_vma_pages_range(mm, addr, end);
> > -
> > -		if (!may_expand_vm(mm, vm_flags,
> > -					(len >> PAGE_SHIFT) - nr_pages))
> > -			return -ENOMEM;
> > -	}
> > +	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> >
> > +	/* Check against address space limit. */
> > +	/*
> > +	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > +	 * mapping. Account for the pages it would unmap.
> > +	 */
> 
> Utter pedantry, but could these comments be combined? Bit ugly to have one
> after another like this.

Since this was mainly a relocation, I didn't want to change it too much
but since you asked, I'll do it.

> 
> > +	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> > +		return -ENOMEM;
> >
> >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> >  		return -EPERM;
> > @@ -2971,14 +2974,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  		if (vms_gather_munmap_vmas(&vms, &mas_detach))
> >  			return -ENOMEM;
> >
> > -		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> > -			return -ENOMEM;
> > -
> > -		vms_complete_munmap_vmas(&vms, &mas_detach);
> >  		next = vms.next;
> >  		prev = vms.prev;
> >  		vma = NULL;
> >  	} else {
> > +		/* Minimal setup of vms */
> > +		vms.nr_pages = 0;
> 
> I'm not a huge fan of having vms be uninitialised other than this field and
> then to rely on no further code change accidentally using an uninitialised
> field. This is kind of asking for bugs.
> 
> Can we not find a way to sensibly initialise it somehow?

Yes, I can switch to the same sort of thing as the maple state and
initialize things as empty.

> 
> >  		next = vma_next(&vmi);
> >  		prev = vma_prev(&vmi);
> >  		if (prev)
> > @@ -2990,8 +2991,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	 */
> >  	if (accountable_mapping(file, vm_flags)) {
> >  		charged = len >> PAGE_SHIFT;
> > +		charged -= nr_accounted;
> >  		if (security_vm_enough_memory_mm(mm, charged))
> > -			return -ENOMEM;
> > +			goto abort_munmap;
> > +		vms.nr_accounted = 0;
> 
> This is kind of expanding the 'vms possibly unitialised apart from selected
> fields' pattern, makes me worry.

I'll fix this with an init of the struct that will always be called.

> 
> >  		vm_flags |= VM_ACCOUNT;
> >  	}
> >
> > @@ -3040,10 +3043,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	 * not unmapped, but the maps are removed from the list.
> >  	 */
> >  	vma = vm_area_alloc(mm);
> > -	if (!vma) {
> > -		error = -ENOMEM;
> > +	if (!vma)
> >  		goto unacct_error;
> > -	}
> >
> >  	vma_iter_config(&vmi, addr, end);
> >  	vma_set_range(vma, addr, end, pgoff);
> > @@ -3052,6 +3053,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >
> >  	if (file) {
> >  		vma->vm_file = get_file(file);
> > +		/* call_mmap() map PTE, so ensure there are no existing PTEs */
> 
> Typo? Should this be 'call_mmap() maps PTEs, so ensure there are no
> existing PTEs'? I feel like this could be reworded something like:
> 
> 'call_map() may map PTEs, so clear any that may be pending unmap ahead of
> time.'

I had changed this already to 'call_mmap() may map PTE, so ensure there
are no existing PTEs'  That way it's still one line and more descriptive
than what I had.

> 
> > +		if (vms.nr_pages)
> > +			vms_complete_pte_clear(&vms, &mas_detach, true);
> >  		error = call_mmap(file, vma);
> >  		if (error)
> >  			goto unmap_and_free_vma;
> > @@ -3142,6 +3146,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  expanded:
> >  	perf_event_mmap(vma);
> >
> > +	if (vms.nr_pages)
> > +		vms_complete_munmap_vmas(&vms, &mas_detach);
> > +
> 
> Hang on, if we already did this in the if (file) branch above, might we end
> up calling this twice? I didn't see vms.nr_pages get set to zero or
> decremented anywhere (unless I missed it)?

No, we called the new helper vms_complete_pte_clear(), which will avoid
clearing the ptes by the added flag vms->cleared_ptes in the second
call.

Above, I modified vms_complete_pte_clear() to check vms->cleared_ptes
prior to clearing the ptes, so it will only be cleared if it needs
clearing.

I debated moving this nr_pages check within vms_complete_munmap_vmas(),
but that would add an unnecessary check to the munmap() path.  Avoiding
both checks seemed too much code (yet another static inline, or such).
I also wanted to keep the sanity of nr_pages checking to a single
function - as you highlighted it could be a path to insanity.

Considering I'll switch this ti a VMS_INIT(), I think that I could pass
it through and do the logic within the static inline at the expense of
the munmap() having a few extra instructions (but no cache hits, so not
a really big deal).

> 
> >  	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> >  	if (vm_flags & VM_LOCKED) {
> >  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> > @@ -3189,6 +3196,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  unacct_error:
> >  	if (charged)
> >  		vm_unacct_memory(charged);
> > +
> > +abort_munmap:
> > +	if (vms.nr_pages)
> > +		abort_munmap_vmas(&mas_detach);
> >  	validate_mm(mm);
> >  	return error;
> >  }
> > --
> > 2.43.0
> >
> 
> In general I like the approach and you've made it very clear how you've
> altered this behaviour.
> 
> However I have a few concerns (as well some trivial comments) above. With
> those cleared up we'll be good to go!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-08 12:52   ` Lorenzo Stoakes
@ 2024-07-08 20:43     ` Liam R. Howlett
  2024-07-09 14:42       ` Liam R. Howlett
  2024-07-09 14:45       ` Lorenzo Stoakes
  0 siblings, 2 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-08 20:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > call, so use it instead of looping over the vmas twice.
> 
> Predictably indeed you removed the thing I commented on in the last patch
> ;) but at least this time I predicted it! ;)
> 
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 36 ++++--------------------------------
> >  1 file changed, 4 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index b2de26683903..62edaabf3987 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> >  		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
> >  }
> >
> > -static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > -		unsigned long addr, unsigned long end,
> > -		unsigned long *nr_accounted)
> > -{
> > -	VMA_ITERATOR(vmi, mm, addr);
> > -	struct vm_area_struct *vma;
> > -	unsigned long nr_pages = 0;
> > -
> > -	*nr_accounted = 0;
> > -	for_each_vma_range(vmi, vma, end) {
> > -		unsigned long vm_start = max(addr, vma->vm_start);
> > -		unsigned long vm_end = min(end, vma->vm_end);
> > -
> > -		nr_pages += PHYS_PFN(vm_end - vm_start);
> > -		if (vma->vm_flags & VM_ACCOUNT)
> > -			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> > -	}
> > -
> > -	return nr_pages;
> > -}
> > -
> >  static void __vma_link_file(struct vm_area_struct *vma,
> >  			    struct address_space *mapping)
> >  {
> > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	pgoff_t vm_pgoff;
> >  	int error = -ENOMEM;
> >  	VMA_ITERATOR(vmi, mm, addr);
> > -	unsigned long nr_pages, nr_accounted;
> > -
> > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > -
> > -	/* Check against address space limit. */
> > -	/*
> > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > -	 * mapping. Account for the pages it would unmap.
> > -	 */
> > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > -		return -ENOMEM;
> >
> >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> >  		return -EPERM;
> > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  			vma_iter_next_range(&vmi);
> >  	}
> >
> > +	/* Check against address space limit. */
> > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > +		goto abort_munmap;
> > +
> 
> I know you can literally only do this after the vms_gather_munmap_vmas(),
> but this does change where we check this, so for instance we do
> arch_unmap() without having checked may_expand_vm().
> 
> However I assume this is fine?

Thanks for pointing this out.

The functionality here has changed
--- from ---
may_expand_vm() check
can_modify_mm() check
arch_unmap()
vms_gather_munmap_vmas()
...

--- to ---
can_modify_mm() check
arch_unmap()
vms_gather_munmap_vmas()
may_expand_vm() check
...

vms_gather_munmap_vmas() does nothing but figures out what to do later,
but could use memory and can fail.

The user implications are:

1. The return type on the error may change to -EPERM from -ENOMEM, if
you are not allowed to expand and are trying to overwrite mseal()'ed
VMAs. That seems so very rare that I'm not sure it's worth mentioning.


2. arch_unmap() called prior to may_expand_vm().
powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
within the unmap range.  User implication of this means that an
application my set the vdso to NULL prior to hitting the -ENOMEM case in
may_expand_vm() due to the address space limit.

Assuming the removal of the vdso does not cause the application to seg
fault, then the user visible change is that any vdso call after a failed
mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
would fail is if the mapping process was attempting to map a large
enough area over the vdso (which is accounted and in the vma tree,
afaict) and ran out of memory. Note that this situation could arise
already since we could run out of memory (not accounting) after the
arch_unmap() call within the kernel.

The code today can suffer the same fate, but not by the accounting
failure.  It can happen due to failure to allocate a new vma,
do_vmi_munmap() failure after the arch_unmap() call, or any of the other
failure scenarios later in the mmap_region() function.

At the very least, this requires an expanded change log.

> 
> >  	/*
> >  	 * Private writable mapping: check memory availability
> >  	 */
> > --
> > 2.43.0
> >
> 
> Looks good to me generally,
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-07-08 19:10     ` Liam R. Howlett
@ 2024-07-09 14:27       ` Lorenzo Stoakes
  2024-07-09 18:43         ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-09 14:27 UTC (permalink / raw)
  To: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook

On Mon, Jul 08, 2024 at 03:10:10PM GMT, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:18]:
> > On Thu, Jul 04, 2024 at 02:27:15PM GMT, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > area be overwritten and then clean up the gathered vmas using
> > > vms_complete_munmap_vmas().
> > >
> > > In the case of a driver mapping over existing vmas, the PTEs are cleared
> > > using the helper vms_complete_pte_clear().
> > >
> > > Temporarily keep track of the number of pages that will be removed and
> > > reduce the charged amount.
> > >
> > > This also drops the validate_mm() call in the vma_expand() function.
> > > It is necessary to drop the validate as it would fail since the mm
> > > map_count would be incorrect during a vma expansion, prior to the
> > > cleanup from vms_complete_munmap_vmas().
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > ---
> > >  mm/internal.h |  1 +
> > >  mm/mmap.c     | 61 ++++++++++++++++++++++++++++++---------------------
> > >  2 files changed, 37 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 4c9f06669cc4..fae4a1bba732 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
> > >  	unsigned long stack_vm;
> > >  	unsigned long data_vm;
> > >  	bool unlock;			/* Unlock after the munmap */
> > > +	bool cleared_ptes;		/* If the PTE are cleared already */
> > >  };
> > >
> > >  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 5d458c5f080e..0c334eeae8cd 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -401,17 +401,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> > >  }
> > >
> > >  static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > > -		unsigned long addr, unsigned long end)
> > > +		unsigned long addr, unsigned long end,
> > > +		unsigned long *nr_accounted)
> > >  {
> > >  	VMA_ITERATOR(vmi, mm, addr);
> > >  	struct vm_area_struct *vma;
> > >  	unsigned long nr_pages = 0;
> > >
> > > +	*nr_accounted = 0;
> > >  	for_each_vma_range(vmi, vma, end) {
> > >  		unsigned long vm_start = max(addr, vma->vm_start);
> > >  		unsigned long vm_end = min(end, vma->vm_end);
> > >
> > >  		nr_pages += PHYS_PFN(vm_end - vm_start);
> > > +		if (vma->vm_flags & VM_ACCOUNT)
> > > +			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> >
> > We're duplicating the PHYS_PFN(vm_end - vm_start) thing, probably worth
> > adding something like:
> >
> > unsigned long num_pages = PHYS_PFN(vm_end - vm_start);
> >
> > Side-note, but it'd be nice to sort out the inconsistency of PHYS_PFN()
> > vs. (end - start) >> PAGE_SHIFT. This is probably not a huge deal though...
>
> I split this out into another patch for easier reviewing.

Yeah I noticed, inevitably :) the PHYS_PFN(...) duplication persisted, a
small thing obviously but covered in the subsequent commit.

>
> >
> > >  	}
> > >
> > >  	return nr_pages;
> > > @@ -522,6 +526,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > >  	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> > >  	vms->unmap_start = FIRST_USER_ADDRESS;
> > >  	vms->unmap_end = USER_PGTABLES_CEILING;
> > > +	vms->cleared_ptes = false;
> > >  }
> > >
> > >  /*
> > > @@ -730,7 +735,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > >  	vma_iter_store(vmi, vma);
> > >
> > >  	vma_complete(&vp, vmi, vma->vm_mm);
> > > -	validate_mm(vma->vm_mm);
> >
> > Since we're dropping this here, do we need to re-add this back somehwere
> > where we are confident the state will be consistent?
>
> The vma_expand() function is used in two places - one is in the mmap.c
> file which can no longer validate the mm until the munmap is complete.
> The other is in fs/exec.c which cannot call the validate_mm().  So
> to add this call back, I'd have to add a wrapper to vma_expand() to call
> the validate_mm() function for debug builds.
>
> Really all this code in fs/exec.c doesn't belong there so we don't need
> to do an extra function wrapper just to call validate_mm(). And you have
> a patch to do that which is out for review!

Indeed :) perhaps we should add back to the wrapper?

>
> >
> > >  	return 0;
> > >
> > >  nomem:
> > > @@ -2612,6 +2616,9 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > >  {
> > >  	struct mmu_gather tlb;
> > >
> > > +	if (vms->cleared_ptes)
> > > +		return;
> > > +
> > >  	/*
> > >  	 * We can free page tables without write-locking mmap_lock because VMAs
> > >  	 * were isolated before we downgraded mmap_lock.
> > > @@ -2624,6 +2631,7 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > >  	mas_set(mas_detach, 1);
> > >  	free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
> > >  	tlb_finish_mmu(&tlb);
> > > +	vms->cleared_ptes = true;
> > >  }
> > >
> > >  /*
> > > @@ -2936,24 +2944,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	unsigned long merge_start = addr, merge_end = end;
> > >  	bool writable_file_mapping = false;
> > >  	pgoff_t vm_pgoff;
> > > -	int error;
> > > +	int error = -ENOMEM;
> > >  	VMA_ITERATOR(vmi, mm, addr);
> > > +	unsigned long nr_pages, nr_accounted;
> > >
> > > -	/* Check against address space limit. */
> > > -	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> > > -		unsigned long nr_pages;
> > > -
> > > -		/*
> > > -		 * MAP_FIXED may remove pages of mappings that intersects with
> > > -		 * requested mapping. Account for the pages it would unmap.
> > > -		 */
> > > -		nr_pages = count_vma_pages_range(mm, addr, end);
> > > -
> > > -		if (!may_expand_vm(mm, vm_flags,
> > > -					(len >> PAGE_SHIFT) - nr_pages))
> > > -			return -ENOMEM;
> > > -	}
> > > +	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > >
> > > +	/* Check against address space limit. */
> > > +	/*
> > > +	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > +	 * mapping. Account for the pages it would unmap.
> > > +	 */
> >
> > Utter pedantry, but could these comments be combined? Bit ugly to have one
> > after another like this.
>
> Since this was mainly a relocation, I didn't want to change it too much
> but since you asked, I'll do it.

Thanks, obviously a highly pedantic nit this one!

>
> >
> > > +	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> > > +		return -ENOMEM;
> > >
> > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > >  		return -EPERM;
> > > @@ -2971,14 +2974,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  		if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > >  			return -ENOMEM;
> > >
> > > -		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> > > -			return -ENOMEM;
> > > -
> > > -		vms_complete_munmap_vmas(&vms, &mas_detach);
> > >  		next = vms.next;
> > >  		prev = vms.prev;
> > >  		vma = NULL;
> > >  	} else {
> > > +		/* Minimal setup of vms */
> > > +		vms.nr_pages = 0;
> >
> > I'm not a huge fan of having vms be uninitialised other than this field and
> > then to rely on no further code change accidentally using an uninitialised
> > field. This is kind of asking for bugs.
> >
> > Can we not find a way to sensibly initialise it somehow?
>
> Yes, I can switch to the same sort of thing as the maple state and
> initialize things as empty.

Thanks.

>
> >
> > >  		next = vma_next(&vmi);
> > >  		prev = vma_prev(&vmi);
> > >  		if (prev)
> > > @@ -2990,8 +2991,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	 */
> > >  	if (accountable_mapping(file, vm_flags)) {
> > >  		charged = len >> PAGE_SHIFT;
> > > +		charged -= nr_accounted;
> > >  		if (security_vm_enough_memory_mm(mm, charged))
> > > -			return -ENOMEM;
> > > +			goto abort_munmap;
> > > +		vms.nr_accounted = 0;
> >
> > This is kind of expanding the 'vms possibly unitialised apart from selected
> > fields' pattern, makes me worry.
>
> I'll fix this with an init of the struct that will always be called.

Thanks.

>
> >
> > >  		vm_flags |= VM_ACCOUNT;
> > >  	}
> > >
> > > @@ -3040,10 +3043,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	 * not unmapped, but the maps are removed from the list.
> > >  	 */
> > >  	vma = vm_area_alloc(mm);
> > > -	if (!vma) {
> > > -		error = -ENOMEM;
> > > +	if (!vma)
> > >  		goto unacct_error;
> > > -	}
> > >
> > >  	vma_iter_config(&vmi, addr, end);
> > >  	vma_set_range(vma, addr, end, pgoff);
> > > @@ -3052,6 +3053,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >
> > >  	if (file) {
> > >  		vma->vm_file = get_file(file);
> > > +		/* call_mmap() map PTE, so ensure there are no existing PTEs */
> >
> > Typo? Should this be 'call_mmap() maps PTEs, so ensure there are no
> > existing PTEs'? I feel like this could be reworded something like:
> >
> > 'call_map() may map PTEs, so clear any that may be pending unmap ahead of
> > time.'
>
> I had changed this already to 'call_mmap() may map PTE, so ensure there
> are no existing PTEs'  That way it's still one line and more descriptive
> than what I had.

That works!

>
> >
> > > +		if (vms.nr_pages)
> > > +			vms_complete_pte_clear(&vms, &mas_detach, true);
> > >  		error = call_mmap(file, vma);
> > >  		if (error)
> > >  			goto unmap_and_free_vma;
> > > @@ -3142,6 +3146,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  expanded:
> > >  	perf_event_mmap(vma);
> > >
> > > +	if (vms.nr_pages)
> > > +		vms_complete_munmap_vmas(&vms, &mas_detach);
> > > +
> >
> > Hang on, if we already did this in the if (file) branch above, might we end
> > up calling this twice? I didn't see vms.nr_pages get set to zero or
> > decremented anywhere (unless I missed it)?
>
> No, we called the new helper vms_complete_pte_clear(), which will avoid
> clearing the ptes by the added flag vms->cleared_ptes in the second
> call.
>
> Above, I modified vms_complete_pte_clear() to check vms->cleared_ptes
> prior to clearing the ptes, so it will only be cleared if it needs
> clearing.
>
> I debated moving this nr_pages check within vms_complete_munmap_vmas(),
> but that would add an unnecessary check to the munmap() path.  Avoiding
> both checks seemed too much code (yet another static inline, or such).
> I also wanted to keep the sanity of nr_pages checking to a single
> function - as you highlighted it could be a path to insanity.
>
> Considering I'll switch this ti a VMS_INIT(), I think that I could pass
> it through and do the logic within the static inline at the expense of
> the munmap() having a few extra instructions (but no cache hits, so not
> a really big deal).

Yeah it's a bit confusing that the rest of vms_complete_munmap_vmas() is
potentially run twice even if the vms_complete_pte_clear() exits early due
to vms->cleared_ptes being set.

>
> >
> > >  	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > >  	if (vm_flags & VM_LOCKED) {
> > >  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> > > @@ -3189,6 +3196,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  unacct_error:
> > >  	if (charged)
> > >  		vm_unacct_memory(charged);
> > > +
> > > +abort_munmap:
> > > +	if (vms.nr_pages)
> > > +		abort_munmap_vmas(&mas_detach);
> > >  	validate_mm(mm);
> > >  	return error;
> > >  }
> > > --
> > > 2.43.0
> > >
> >
> > In general I like the approach and you've made it very clear how you've
> > altered this behaviour.
> >
> > However I have a few concerns (as well some trivial comments) above. With
> > those cleared up we'll be good to go!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-08 20:43     ` Liam R. Howlett
@ 2024-07-09 14:42       ` Liam R. Howlett
  2024-07-09 14:51         ` Lorenzo Stoakes
  2024-07-09 14:52         ` Liam R. Howlett
  2024-07-09 14:45       ` Lorenzo Stoakes
  1 sibling, 2 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-09 14:42 UTC (permalink / raw)
  To: Lorenzo Stoakes, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook

Cc'ing Dave Hansen on this.

* Liam R. Howlett <Liam.Howlett@oracle.com> [240708 16:43]:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> > On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > > call, so use it instead of looping over the vmas twice.
> > 
> > Predictably indeed you removed the thing I commented on in the last patch
> > ;) but at least this time I predicted it! ;)
> > 
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > ---
> > >  mm/mmap.c | 36 ++++--------------------------------
> > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index b2de26683903..62edaabf3987 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c

...

> > >  static void __vma_link_file(struct vm_area_struct *vma,
> > >  			    struct address_space *mapping)
> > >  {
> > > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	pgoff_t vm_pgoff;
> > >  	int error = -ENOMEM;
> > >  	VMA_ITERATOR(vmi, mm, addr);
> > > -	unsigned long nr_pages, nr_accounted;
> > > -
> > > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > > -
> > > -	/* Check against address space limit. */
> > > -	/*
> > > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > -	 * mapping. Account for the pages it would unmap.
> > > -	 */
> > > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > > -		return -ENOMEM;
> > >
> > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > >  		return -EPERM;
> > > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  			vma_iter_next_range(&vmi);
> > >  	}
> > >
> > > +	/* Check against address space limit. */
> > > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > > +		goto abort_munmap;
> > > +
> > 
> > I know you can literally only do this after the vms_gather_munmap_vmas(),
> > but this does change where we check this, so for instance we do
> > arch_unmap() without having checked may_expand_vm().
> > 
> > However I assume this is fine?
> 
> Thanks for pointing this out.
> 
> The functionality here has changed
> --- from ---
> may_expand_vm() check
> can_modify_mm() check
> arch_unmap()
> vms_gather_munmap_vmas()
> ...
> 
> --- to ---
> can_modify_mm() check
> arch_unmap()
> vms_gather_munmap_vmas()
> may_expand_vm() check
> ...
> 
> vms_gather_munmap_vmas() does nothing but figures out what to do later,
> but could use memory and can fail.
> 
> The user implications are:
> 
> 1. The return type on the error may change to -EPERM from -ENOMEM, if
> you are not allowed to expand and are trying to overwrite mseal()'ed
> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> 
> 
> 2. arch_unmap() called prior to may_expand_vm().
> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> within the unmap range.  User implication of this means that an
> application my set the vdso to NULL prior to hitting the -ENOMEM case in
> may_expand_vm() due to the address space limit.
> 
> Assuming the removal of the vdso does not cause the application to seg
> fault, then the user visible change is that any vdso call after a failed
> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> would fail is if the mapping process was attempting to map a large
> enough area over the vdso (which is accounted and in the vma tree,
> afaict) and ran out of memory. Note that this situation could arise
> already since we could run out of memory (not accounting) after the
> arch_unmap() call within the kernel.
> 
> The code today can suffer the same fate, but not by the accounting
> failure.  It can happen due to failure to allocate a new vma,
> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> failure scenarios later in the mmap_region() function.
>
> At the very least, this requires an expanded change log.

After doing a deep dive into the vdso issue, I think it would be best to
remove the arch_unmap() call completely in a later patch set by changing
the two areas highlighted by Dave in patch 5a28fc94c914 "x86/mpx,
mm/core: Fix recursive munmap() corruption" back in 2019 in regards to
the powerpc pointer use.  But that's for later work.

In the above mentioned patch, the arch_unmap() was moved to an earlier
time to avoid removing the same vma twice from the rbtree.  Since the
mpx code no longer removes the vma and powerpc never removed the vma, it
seems safe to reorder the calls as such:

can_modify_mm() check
vms_gather_munmap_vmas()
may_expand_vm() check
arch_unmap()

This seems very much fine because:
- powerpc is the only platform doing _anything_ in arch_unmap().
- powerpc used to work with the arch_unmap() call  after the vma was
  completely dropped.
- The vma isn't even dropped by this point and so all proposed changes
  will be completely undone in the rare case of may_expand_vm() failure.
- The arch_unmap() call doesn't need to be that early anymore anyways
  (mpx was dropped by Dave in 2020 git id ccaaaf6fe5a5).

I will make the order change in v4 of the patch series in its own patch.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-08 20:43     ` Liam R. Howlett
  2024-07-09 14:42       ` Liam R. Howlett
@ 2024-07-09 14:45       ` Lorenzo Stoakes
  2024-07-10 12:28         ` Michael Ellerman
  1 sibling, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-09 14:45 UTC (permalink / raw)
  To: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook

On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> > On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > >
> > > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > > call, so use it instead of looping over the vmas twice.
> >
> > Predictably indeed you removed the thing I commented on in the last patch
> > ;) but at least this time I predicted it! ;)
> >
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > ---
> > >  mm/mmap.c | 36 ++++--------------------------------
> > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index b2de26683903..62edaabf3987 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> > >  		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
> > >  }
> > >
> > > -static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > > -		unsigned long addr, unsigned long end,
> > > -		unsigned long *nr_accounted)
> > > -{
> > > -	VMA_ITERATOR(vmi, mm, addr);
> > > -	struct vm_area_struct *vma;
> > > -	unsigned long nr_pages = 0;
> > > -
> > > -	*nr_accounted = 0;
> > > -	for_each_vma_range(vmi, vma, end) {
> > > -		unsigned long vm_start = max(addr, vma->vm_start);
> > > -		unsigned long vm_end = min(end, vma->vm_end);
> > > -
> > > -		nr_pages += PHYS_PFN(vm_end - vm_start);
> > > -		if (vma->vm_flags & VM_ACCOUNT)
> > > -			*nr_accounted += PHYS_PFN(vm_end - vm_start);
> > > -	}
> > > -
> > > -	return nr_pages;
> > > -}
> > > -
> > >  static void __vma_link_file(struct vm_area_struct *vma,
> > >  			    struct address_space *mapping)
> > >  {
> > > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	pgoff_t vm_pgoff;
> > >  	int error = -ENOMEM;
> > >  	VMA_ITERATOR(vmi, mm, addr);
> > > -	unsigned long nr_pages, nr_accounted;
> > > -
> > > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > > -
> > > -	/* Check against address space limit. */
> > > -	/*
> > > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > -	 * mapping. Account for the pages it would unmap.
> > > -	 */
> > > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > > -		return -ENOMEM;
> > >
> > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > >  		return -EPERM;
> > > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  			vma_iter_next_range(&vmi);
> > >  	}
> > >
> > > +	/* Check against address space limit. */
> > > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > > +		goto abort_munmap;
> > > +
> >
> > I know you can literally only do this after the vms_gather_munmap_vmas(),
> > but this does change where we check this, so for instance we do
> > arch_unmap() without having checked may_expand_vm().
> >
> > However I assume this is fine?
>
> Thanks for pointing this out.
>
> The functionality here has changed
> --- from ---
> may_expand_vm() check
> can_modify_mm() check
> arch_unmap()
> vms_gather_munmap_vmas()
> ...
>
> --- to ---
> can_modify_mm() check
> arch_unmap()
> vms_gather_munmap_vmas()
> may_expand_vm() check
> ...
>
> vms_gather_munmap_vmas() does nothing but figures out what to do later,
> but could use memory and can fail.
>
> The user implications are:
>
> 1. The return type on the error may change to -EPERM from -ENOMEM, if
> you are not allowed to expand and are trying to overwrite mseal()'ed
> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
>
>
> 2. arch_unmap() called prior to may_expand_vm().
> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> within the unmap range.  User implication of this means that an
> application my set the vdso to NULL prior to hitting the -ENOMEM case in
> may_expand_vm() due to the address space limit.
>
> Assuming the removal of the vdso does not cause the application to seg
> fault, then the user visible change is that any vdso call after a failed
> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> would fail is if the mapping process was attempting to map a large
> enough area over the vdso (which is accounted and in the vma tree,
> afaict) and ran out of memory. Note that this situation could arise
> already since we could run out of memory (not accounting) after the
> arch_unmap() call within the kernel.
>
> The code today can suffer the same fate, but not by the accounting
> failure.  It can happen due to failure to allocate a new vma,
> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> failure scenarios later in the mmap_region() function.
>
> At the very least, this requires an expanded change log.

Indeed, also (as mentioned on IRC) I feel like we need to look at whether
we _truly_ need this arch_unmap() call for a single, rather antiquated,
architecture.

I mean why are they unmapping the VDSO, why is that valid, why does it need
that field to be set to NULL, is it possible to signify that in some other
way etc.?

Regardless, I think the change you make here is fine and shouldn't be a
blocker for your changes at all.

But agreed, I do think expanding the change log a bit would be helpful.

>
> >
> > >  	/*
> > >  	 * Private writable mapping: check memory availability
> > >  	 */
> > > --
> > > 2.43.0
> > >
> >
> > Looks good to me generally,
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-09 14:42       ` Liam R. Howlett
@ 2024-07-09 14:51         ` Lorenzo Stoakes
  2024-07-09 14:52         ` Liam R. Howlett
  1 sibling, 0 replies; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-09 14:51 UTC (permalink / raw)
  To: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook

On Tue, Jul 09, 2024 at 10:42:41AM GMT, Liam R. Howlett wrote:
> Cc'ing Dave Hansen on this.
>
> * Liam R. Howlett <Liam.Howlett@oracle.com> [240708 16:43]:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> > > On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > >
> > > > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > > > call, so use it instead of looping over the vmas twice.
> > >
> > > Predictably indeed you removed the thing I commented on in the last patch
> > > ;) but at least this time I predicted it! ;)
> > >
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > ---
> > > >  mm/mmap.c | 36 ++++--------------------------------
> > > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index b2de26683903..62edaabf3987 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
>
> ...
>
> > > >  static void __vma_link_file(struct vm_area_struct *vma,
> > > >  			    struct address_space *mapping)
> > > >  {
> > > > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  	pgoff_t vm_pgoff;
> > > >  	int error = -ENOMEM;
> > > >  	VMA_ITERATOR(vmi, mm, addr);
> > > > -	unsigned long nr_pages, nr_accounted;
> > > > -
> > > > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > > > -
> > > > -	/* Check against address space limit. */
> > > > -	/*
> > > > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > > -	 * mapping. Account for the pages it would unmap.
> > > > -	 */
> > > > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > > > -		return -ENOMEM;
> > > >
> > > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > > >  		return -EPERM;
> > > > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  			vma_iter_next_range(&vmi);
> > > >  	}
> > > >
> > > > +	/* Check against address space limit. */
> > > > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > > > +		goto abort_munmap;
> > > > +
> > >
> > > I know you can literally only do this after the vms_gather_munmap_vmas(),
> > > but this does change where we check this, so for instance we do
> > > arch_unmap() without having checked may_expand_vm().
> > >
> > > However I assume this is fine?
> >
> > Thanks for pointing this out.
> >
> > The functionality here has changed
> > --- from ---
> > may_expand_vm() check
> > can_modify_mm() check
> > arch_unmap()
> > vms_gather_munmap_vmas()
> > ...
> >
> > --- to ---
> > can_modify_mm() check
> > arch_unmap()
> > vms_gather_munmap_vmas()
> > may_expand_vm() check
> > ...
> >
> > vms_gather_munmap_vmas() does nothing but figures out what to do later,
> > but could use memory and can fail.
> >
> > The user implications are:
> >
> > 1. The return type on the error may change to -EPERM from -ENOMEM, if
> > you are not allowed to expand and are trying to overwrite mseal()'ed
> > VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> >
> >
> > 2. arch_unmap() called prior to may_expand_vm().
> > powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> > within the unmap range.  User implication of this means that an
> > application my set the vdso to NULL prior to hitting the -ENOMEM case in
> > may_expand_vm() due to the address space limit.
> >
> > Assuming the removal of the vdso does not cause the application to seg
> > fault, then the user visible change is that any vdso call after a failed
> > mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> > would fail is if the mapping process was attempting to map a large
> > enough area over the vdso (which is accounted and in the vma tree,
> > afaict) and ran out of memory. Note that this situation could arise
> > already since we could run out of memory (not accounting) after the
> > arch_unmap() call within the kernel.
> >
> > The code today can suffer the same fate, but not by the accounting
> > failure.  It can happen due to failure to allocate a new vma,
> > do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> > failure scenarios later in the mmap_region() function.
> >
> > At the very least, this requires an expanded change log.
>
> After doing a deep dive into the vdso issue, I think it would be best to
> remove the arch_unmap() call completely in a later patch set by changing
> the two areas highlighted by Dave in patch 5a28fc94c914 "x86/mpx,
> mm/core: Fix recursive munmap() corruption" back in 2019 in regards to
> the powerpc pointer use.  But that's for later work.

Our replies coincided but yes absolutely. But should be a separate patch,
agreed.

>
> In the above mentioned patch, the arch_unmap() was moved to an earlier
> time to avoid removing the same vma twice from the rbtree.  Since the
> mpx code no longer removes the vma and powerpc never removed the vma, it
> seems safe to reorder the calls as such:
>
> can_modify_mm() check
> vms_gather_munmap_vmas()
> may_expand_vm() check
> arch_unmap()
>
> This seems very much fine because:
> - powerpc is the only platform doing _anything_ in arch_unmap().
> - powerpc used to work with the arch_unmap() call  after the vma was
>   completely dropped.
> - The vma isn't even dropped by this point and so all proposed changes
>   will be completely undone in the rare case of may_expand_vm() failure.
> - The arch_unmap() call doesn't need to be that early anymore anyways
>   (mpx was dropped by Dave in 2020 git id ccaaaf6fe5a5).
>
> I will make the order change in v4 of the patch series in its own patch.
>

Great!

> Thanks,
> Liam


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-09 14:42       ` Liam R. Howlett
  2024-07-09 14:51         ` Lorenzo Stoakes
@ 2024-07-09 14:52         ` Liam R. Howlett
  2024-07-09 18:13           ` Dave Hansen
  1 sibling, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-09 14:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Lorenzo Stoakes, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook

* Liam R. Howlett <Liam.Howlett@oracle.com> [240709 10:42]:
> Cc'ing Dave Hansen on this.

Really adding Dave to the discussion.

> 
> * Liam R. Howlett <Liam.Howlett@oracle.com> [240708 16:43]:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:53]:
> > > On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > >
> > > > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > > > call, so use it instead of looping over the vmas twice.
> > > 
> > > Predictably indeed you removed the thing I commented on in the last patch
> > > ;) but at least this time I predicted it! ;)
> > > 
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > ---
> > > >  mm/mmap.c | 36 ++++--------------------------------
> > > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index b2de26683903..62edaabf3987 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> 
> ...
> 
> > > >  static void __vma_link_file(struct vm_area_struct *vma,
> > > >  			    struct address_space *mapping)
> > > >  {
> > > > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  	pgoff_t vm_pgoff;
> > > >  	int error = -ENOMEM;
> > > >  	VMA_ITERATOR(vmi, mm, addr);
> > > > -	unsigned long nr_pages, nr_accounted;
> > > > -
> > > > -	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > > > -
> > > > -	/* Check against address space limit. */
> > > > -	/*
> > > > -	 * MAP_FIXED may remove pages of mappings that intersects with requested
> > > > -	 * mapping. Account for the pages it would unmap.
> > > > -	 */
> > > > -	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > > > -		return -ENOMEM;
> > > >
> > > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > > >  		return -EPERM;
> > > > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  			vma_iter_next_range(&vmi);
> > > >  	}
> > > >
> > > > +	/* Check against address space limit. */
> > > > +	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > > > +		goto abort_munmap;
> > > > +
> > > 
> > > I know you can literally only do this after the vms_gather_munmap_vmas(),
> > > but this does change where we check this, so for instance we do
> > > arch_unmap() without having checked may_expand_vm().
> > > 
> > > However I assume this is fine?
> > 
> > Thanks for pointing this out.
> > 
> > The functionality here has changed
> > --- from ---
> > may_expand_vm() check
> > can_modify_mm() check
> > arch_unmap()
> > vms_gather_munmap_vmas()
> > ...
> > 
> > --- to ---
> > can_modify_mm() check
> > arch_unmap()
> > vms_gather_munmap_vmas()
> > may_expand_vm() check
> > ...
> > 
> > vms_gather_munmap_vmas() does nothing but figures out what to do later,
> > but could use memory and can fail.
> > 
> > The user implications are:
> > 
> > 1. The return type on the error may change to -EPERM from -ENOMEM, if
> > you are not allowed to expand and are trying to overwrite mseal()'ed
> > VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> > 
> > 
> > 2. arch_unmap() called prior to may_expand_vm().
> > powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> > within the unmap range.  User implication of this means that an
> > application my set the vdso to NULL prior to hitting the -ENOMEM case in
> > may_expand_vm() due to the address space limit.
> > 
> > Assuming the removal of the vdso does not cause the application to seg
> > fault, then the user visible change is that any vdso call after a failed
> > mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> > would fail is if the mapping process was attempting to map a large
> > enough area over the vdso (which is accounted and in the vma tree,
> > afaict) and ran out of memory. Note that this situation could arise
> > already since we could run out of memory (not accounting) after the
> > arch_unmap() call within the kernel.
> > 
> > The code today can suffer the same fate, but not by the accounting
> > failure.  It can happen due to failure to allocate a new vma,
> > do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> > failure scenarios later in the mmap_region() function.
> >
> > At the very least, this requires an expanded change log.
> 
> After doing a deep dive into the vdso issue, I think it would be best to
> remove the arch_unmap() call completely in a later patch set by changing
> the two areas highlighted by Dave in patch 5a28fc94c914 "x86/mpx,
> mm/core: Fix recursive munmap() corruption" back in 2019 in regards to
> the powerpc pointer use.  But that's for later work.
> 
> In the above mentioned patch, the arch_unmap() was moved to an earlier
> time to avoid removing the same vma twice from the rbtree.  Since the
> mpx code no longer removes the vma and powerpc never removed the vma, it
> seems safe to reorder the calls as such:
> 
> can_modify_mm() check
> vms_gather_munmap_vmas()
> may_expand_vm() check
> arch_unmap()
> 
> This seems very much fine because:
> - powerpc is the only platform doing _anything_ in arch_unmap().
> - powerpc used to work with the arch_unmap() call  after the vma was
>   completely dropped.
> - The vma isn't even dropped by this point and so all proposed changes
>   will be completely undone in the rare case of may_expand_vm() failure.
> - The arch_unmap() call doesn't need to be that early anymore anyways
>   (mpx was dropped by Dave in 2020 git id ccaaaf6fe5a5).
> 
> I will make the order change in v4 of the patch series in its own patch.
> 
> Thanks,
> Liam


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-09 14:52         ` Liam R. Howlett
@ 2024-07-09 18:13           ` Dave Hansen
  0 siblings, 0 replies; 78+ messages in thread
From: Dave Hansen @ 2024-07-09 18:13 UTC (permalink / raw)
  To: Liam R. Howlett, Dave Hansen, Lorenzo Stoakes, linux-mm,
	Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On 7/9/24 07:52, Liam R. Howlett wrote:
...
> - The arch_unmap() call doesn't need to be that early anymore anyways
>   (mpx was dropped by Dave in 2020 git id ccaaaf6fe5a5).

Yep.  MPX is long gone and x86 doesn't do anything with arch_unmap() any
more.  Moving it around won't affect x86 at all.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 14/16] mm/mmap: Use PHYS_PFN in mmap_region()
  2024-07-08 12:21   ` Lorenzo Stoakes
@ 2024-07-09 18:35     ` Liam R. Howlett
  2024-07-09 18:42       ` Lorenzo Stoakes
  2024-07-10 17:32     ` Suren Baghdasaryan
  1 sibling, 1 reply; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-09 18:35 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:21]:
> On Thu, Jul 04, 2024 at 02:27:16PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Instead of shifting the length by PAGE_SIZE, use PHYS_PFN.  Also use the
> > existing local variable everywhere instead of some of the time.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 0c334eeae8cd..b14da6bd257f 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2935,7 +2935,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	struct mm_struct *mm = current->mm;
> >  	struct vm_area_struct *vma = NULL;
> >  	struct vm_area_struct *next, *prev, *merge;
> > -	pgoff_t pglen = len >> PAGE_SHIFT;
> > +	pgoff_t pglen = PHYS_PFN(len);
> >  	unsigned long charged = 0;
> >  	struct vma_munmap_struct vms;
> >  	struct ma_state mas_detach;
> > @@ -2955,7 +2955,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	 * MAP_FIXED may remove pages of mappings that intersects with requested
> >  	 * mapping. Account for the pages it would unmap.
> >  	 */
> > -	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> > +	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> >  		return -ENOMEM;
> >
> >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > @@ -2990,7 +2990,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	 * Private writable mapping: check memory availability
> >  	 */
> >  	if (accountable_mapping(file, vm_flags)) {
> > -		charged = len >> PAGE_SHIFT;
> > +		charged = pglen;
> >  		charged -= nr_accounted;
> >  		if (security_vm_enough_memory_mm(mm, charged))
> >  			goto abort_munmap;
> > @@ -3149,14 +3149,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	if (vms.nr_pages)
> >  		vms_complete_munmap_vmas(&vms, &mas_detach);
> >
> > -	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > +	vm_stat_account(mm, vm_flags, pglen);
> >  	if (vm_flags & VM_LOCKED) {
> >  		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> >  					is_vm_hugetlb_page(vma) ||
> >  					vma == get_gate_vma(current->mm))
> >  			vm_flags_clear(vma, VM_LOCKED_MASK);
> >  		else
> > -			mm->locked_vm += (len >> PAGE_SHIFT);
> > +			mm->locked_vm += pglen;
> >  	}
> >
> >  	if (file)
> > --
> > 2.43.0
> >
> 
> Maybe I should literally look ahead before making comments :)) thanks for
> reading my mind and doing what I asked though! ;)
> 
> However I don't think you've fixed the duplication of PHYS_PFN(vm_end -
> vm_start) in count_vma_pages_range() - still worth doing I think.

I drop that function in the last patch so probably not worth doing.
This is just a few patches before the axe drops.

> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 14/16] mm/mmap: Use PHYS_PFN in mmap_region()
  2024-07-09 18:35     ` Liam R. Howlett
@ 2024-07-09 18:42       ` Lorenzo Stoakes
  0 siblings, 0 replies; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-09 18:42 UTC (permalink / raw)
  To: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook

On Tue, Jul 09, 2024 at 02:35:16PM GMT, Liam R. Howlett wrote:
[snip]

> >
> > Maybe I should literally look ahead before making comments :)) thanks for
> > reading my mind and doing what I asked though! ;)
> >
> > However I don't think you've fixed the duplication of PHYS_PFN(vm_end -
> > vm_start) in count_vma_pages_range() - still worth doing I think.
>
> I drop that function in the last patch so probably not worth doing.
> This is just a few patches before the axe drops.
>

Actually that's a fair point - I think its fine to do without this nit with
that context!

This is the peril of reviewing forwards through the series and being
surprised later when things are addressed in subsequent patches (or become,
ultimately, irrelevant!).

> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-07-09 14:27       ` Lorenzo Stoakes
@ 2024-07-09 18:43         ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-09 18:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240709 10:27]:
> On Mon, Jul 08, 2024 at 03:10:10PM GMT, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240708 08:18]:
> > > On Thu, Jul 04, 2024 at 02:27:15PM GMT, Liam R. Howlett wrote:
> > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > >
> > > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > > area be overwritten and then clean up the gathered vmas using
> > > > vms_complete_munmap_vmas().
> > > >
> > > > In the case of a driver mapping over existing vmas, the PTEs are cleared
> > > > using the helper vms_complete_pte_clear().
> > > >
> > > > Temporarily keep track of the number of pages that will be removed and
> > > > reduce the charged amount.
> > > >
> > > > This also drops the validate_mm() call in the vma_expand() function.
> > > > It is necessary to drop the validate as it would fail since the mm
> > > > map_count would be incorrect during a vma expansion, prior to the
> > > > cleanup from vms_complete_munmap_vmas().
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > ---
> > > >  mm/internal.h |  1 +
> > > >  mm/mmap.c     | 61 ++++++++++++++++++++++++++++++---------------------
> > > >  2 files changed, 37 insertions(+), 25 deletions(-)
> > > >
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 4c9f06669cc4..fae4a1bba732 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -1503,6 +1503,7 @@ struct vma_munmap_struct {
> > > >  	unsigned long stack_vm;
> > > >  	unsigned long data_vm;
> > > >  	bool unlock;			/* Unlock after the munmap */
> > > > +	bool cleared_ptes;		/* If the PTE are cleared already */
> > > >  };
> > > >
> > > >  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 5d458c5f080e..0c334eeae8cd 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
...
> > > > @@ -522,6 +526,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > > >  	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> > > >  	vms->unmap_start = FIRST_USER_ADDRESS;
> > > >  	vms->unmap_end = USER_PGTABLES_CEILING;
> > > > +	vms->cleared_ptes = false;
> > > >  }
> > > >
> > > >  /*
> > > > @@ -730,7 +735,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > >  	vma_iter_store(vmi, vma);
> > > >
> > > >  	vma_complete(&vp, vmi, vma->vm_mm);
> > > > -	validate_mm(vma->vm_mm);
> > >
> > > Since we're dropping this here, do we need to re-add this back somehwere
> > > where we are confident the state will be consistent?
> >
> > The vma_expand() function is used in two places - one is in the mmap.c
> > file which can no longer validate the mm until the munmap is complete.
> > The other is in fs/exec.c which cannot call the validate_mm().  So
> > to add this call back, I'd have to add a wrapper to vma_expand() to call
> > the validate_mm() function for debug builds.
> >
> > Really all this code in fs/exec.c doesn't belong there so we don't need
> > to do an extra function wrapper just to call validate_mm(). And you have
> > a patch to do that which is out for review!
> 
> Indeed :) perhaps we should add back to the wrapper?
> 
...

> > >
> > > > +	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> > > > +		return -ENOMEM;
> > > >
> > > >  	if (unlikely(!can_modify_mm(mm, addr, end)))
> > > >  		return -EPERM;
> > > > @@ -2971,14 +2974,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  		if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > > >  			return -ENOMEM;
> > > >
> > > > -		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> > > > -			return -ENOMEM;
> > > > -
> > > > -		vms_complete_munmap_vmas(&vms, &mas_detach);
> > > >  		next = vms.next;
> > > >  		prev = vms.prev;
> > > >  		vma = NULL;
> > > >  	} else {
> > > > +		/* Minimal setup of vms */
> > > > +		vms.nr_pages = 0;
> > >
...

> > > > @@ -3052,6 +3053,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >
> > > >  	if (file) {
> > > >  		vma->vm_file = get_file(file);
> > > > +		/* call_mmap() map PTE, so ensure there are no existing PTEs */
...
> > > > +		if (vms.nr_pages)
> > > > +			vms_complete_pte_clear(&vms, &mas_detach, true);
> > > >  		error = call_mmap(file, vma);
> > > >  		if (error)
> > > >  			goto unmap_and_free_vma;
> > > > @@ -3142,6 +3146,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > > >  expanded:
> > > >  	perf_event_mmap(vma);
> > > >
> > > > +	if (vms.nr_pages)
> > > > +		vms_complete_munmap_vmas(&vms, &mas_detach);
> > > > +
> > >
> > > Hang on, if we already did this in the if (file) branch above, might we end
> > > up calling this twice? I didn't see vms.nr_pages get set to zero or
> > > decremented anywhere (unless I missed it)?
> >
> > No, we called the new helper vms_complete_pte_clear(), which will avoid
> > clearing the ptes by the added flag vms->cleared_ptes in the second
> > call.
> >
> > Above, I modified vms_complete_pte_clear() to check vms->cleared_ptes
> > prior to clearing the ptes, so it will only be cleared if it needs
> > clearing.
> >
> > I debated moving this nr_pages check within vms_complete_munmap_vmas(),
> > but that would add an unnecessary check to the munmap() path.  Avoiding
> > both checks seemed too much code (yet another static inline, or such).
> > I also wanted to keep the sanity of nr_pages checking to a single
> > function - as you highlighted it could be a path to insanity.
> >
> > Considering I'll switch this ti a VMS_INIT(), I think that I could pass
> > it through and do the logic within the static inline at the expense of
> > the munmap() having a few extra instructions (but no cache hits, so not
> > a really big deal).
> 
> Yeah it's a bit confusing that the rest of vms_complete_munmap_vmas() is
> potentially run twice even if the vms_complete_pte_clear() exits early due
> to vms->cleared_ptes being set.

vms_complete_munmap_vmas() is never run twice, it's only ever run once.
vms_complete_pte_clear() is called from  vms_complete_munmap_vmas(), but
will do nothing if cleared_ptes == true, which is set at the end of the
pte_clear() itself, and initialized as false.

Hopefully this becomes more obvious with the change to an INIT_VMS()
paradigm.  I think I'll change the name of vms_complete_pte_clear() in
an attempt to make this more obvious as well (remove the _complete,
probably).

Thanks,
Liam



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-09 14:45       ` Lorenzo Stoakes
@ 2024-07-10 12:28         ` Michael Ellerman
  2024-07-10 12:45           ` Lorenzo Stoakes
  0 siblings, 1 reply; 78+ messages in thread
From: Michael Ellerman @ 2024-07-10 12:28 UTC (permalink / raw)
  To: Lorenzo Stoakes, Liam R. Howlett, linux-mm, Andrew Morton,
	Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, linuxppc-dev

Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
> On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
>>
...
>> The functionality here has changed
>> --- from ---
>> may_expand_vm() check
>> can_modify_mm() check
>> arch_unmap()
>> vms_gather_munmap_vmas()
>> ...
>>
>> --- to ---
>> can_modify_mm() check
>> arch_unmap()
>> vms_gather_munmap_vmas()
>> may_expand_vm() check
>> ...
>>
>> vms_gather_munmap_vmas() does nothing but figures out what to do later,
>> but could use memory and can fail.
>>
>> The user implications are:
>>
>> 1. The return type on the error may change to -EPERM from -ENOMEM, if
>> you are not allowed to expand and are trying to overwrite mseal()'ed
>> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
>>
>>
>> 2. arch_unmap() called prior to may_expand_vm().
>> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
>> within the unmap range.  User implication of this means that an
>> application my set the vdso to NULL prior to hitting the -ENOMEM case in
>> may_expand_vm() due to the address space limit.
>>
>> Assuming the removal of the vdso does not cause the application to seg
>> fault, then the user visible change is that any vdso call after a failed
>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
>> would fail is if the mapping process was attempting to map a large
>> enough area over the vdso (which is accounted and in the vma tree,
>> afaict) and ran out of memory. Note that this situation could arise
>> already since we could run out of memory (not accounting) after the
>> arch_unmap() call within the kernel.
>>
>> The code today can suffer the same fate, but not by the accounting
>> failure.  It can happen due to failure to allocate a new vma,
>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
>> failure scenarios later in the mmap_region() function.
>>
>> At the very least, this requires an expanded change log.
>
> Indeed, also (as mentioned on IRC) I feel like we need to look at whether
> we _truly_ need this arch_unmap() call for a single, rather antiquated,
> architecture.

You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is
factually wrong :) Power10 came out of the fab just a few years ago at
7nm.

> I mean why are they unmapping the VDSO, why is that valid, why does it need
> that field to be set to NULL, is it possible to signify that in some other
> way etc.?

It was originally for CRIU. So a niche workload on a niche architecture.

But from the commit that added it, it sounds like CRIU was using mremap,
which should be handled these days by vdso_mremap(). So it could be that
arch_unmap() is not actually needed for CRIU anymore.

Then I guess we have to decide if removing our arch_unmap() would be an
ABI break, regardless of whether CRIU needs it or not.

cheers


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-10 12:28         ` Michael Ellerman
@ 2024-07-10 12:45           ` Lorenzo Stoakes
  2024-07-10 12:59             ` LEROY Christophe
  0 siblings, 1 reply; 78+ messages in thread
From: Lorenzo Stoakes @ 2024-07-10 12:45 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, linuxppc-dev

On Wed, Jul 10, 2024 at 10:28:01PM GMT, Michael Ellerman wrote:
> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
> > On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
> >>
> ...
> >> The functionality here has changed
> >> --- from ---
> >> may_expand_vm() check
> >> can_modify_mm() check
> >> arch_unmap()
> >> vms_gather_munmap_vmas()
> >> ...
> >>
> >> --- to ---
> >> can_modify_mm() check
> >> arch_unmap()
> >> vms_gather_munmap_vmas()
> >> may_expand_vm() check
> >> ...
> >>
> >> vms_gather_munmap_vmas() does nothing but figures out what to do later,
> >> but could use memory and can fail.
> >>
> >> The user implications are:
> >>
> >> 1. The return type on the error may change to -EPERM from -ENOMEM, if
> >> you are not allowed to expand and are trying to overwrite mseal()'ed
> >> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> >>
> >>
> >> 2. arch_unmap() called prior to may_expand_vm().
> >> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> >> within the unmap range.  User implication of this means that an
> >> application my set the vdso to NULL prior to hitting the -ENOMEM case in
> >> may_expand_vm() due to the address space limit.
> >>
> >> Assuming the removal of the vdso does not cause the application to seg
> >> fault, then the user visible change is that any vdso call after a failed
> >> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> >> would fail is if the mapping process was attempting to map a large
> >> enough area over the vdso (which is accounted and in the vma tree,
> >> afaict) and ran out of memory. Note that this situation could arise
> >> already since we could run out of memory (not accounting) after the
> >> arch_unmap() call within the kernel.
> >>
> >> The code today can suffer the same fate, but not by the accounting
> >> failure.  It can happen due to failure to allocate a new vma,
> >> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> >> failure scenarios later in the mmap_region() function.
> >>
> >> At the very least, this requires an expanded change log.
> >
> > Indeed, also (as mentioned on IRC) I feel like we need to look at whether
> > we _truly_ need this arch_unmap() call for a single, rather antiquated,
> > architecture.
>
> You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is
> factually wrong :) Power10 came out of the fab just a few years ago at
> 7nm.

Fair point ;) perhaps we could go with "rarified"? :>)

>
> > I mean why are they unmapping the VDSO, why is that valid, why does it need
> > that field to be set to NULL, is it possible to signify that in some other
> > way etc.?
>
> It was originally for CRIU. So a niche workload on a niche architecture.
>
> But from the commit that added it, it sounds like CRIU was using mremap,
> which should be handled these days by vdso_mremap(). So it could be that
> arch_unmap() is not actually needed for CRIU anymore.

Oh that's interesting!

>
> Then I guess we have to decide if removing our arch_unmap() would be an
> ABI break, regardless of whether CRIU needs it or not.

Seems to me like an internal implementation detail that should hopefully
not result in anything that should have visible ABI impact?

I guess this is something we ought to assess. It would be useful to
eliminate hooks where we can so we can better control VMA behaviour without
having to worry about an arch being able to do arbitrary things at
unexpected times, especially pertinent where we change the order of things.

>
> cheers

Thanks for taking a look!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-10 12:45           ` Lorenzo Stoakes
@ 2024-07-10 12:59             ` LEROY Christophe
  2024-07-10 16:09               ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: LEROY Christophe @ 2024-07-10 12:59 UTC (permalink / raw)
  To: Lorenzo Stoakes, Michael Ellerman
  Cc: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, linuxppc-dev



Le 10/07/2024 à 14:45, Lorenzo Stoakes a écrit :
> On Wed, Jul 10, 2024 at 10:28:01PM GMT, Michael Ellerman wrote:
>> Lorenzo Stoakes <lorenzo.stoakes@oracle.com> writes:
>>> On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
>>>>
>> ...
>>>> The functionality here has changed
>>>> --- from ---
>>>> may_expand_vm() check
>>>> can_modify_mm() check
>>>> arch_unmap()
>>>> vms_gather_munmap_vmas()
>>>> ...
>>>>
>>>> --- to ---
>>>> can_modify_mm() check
>>>> arch_unmap()
>>>> vms_gather_munmap_vmas()
>>>> may_expand_vm() check
>>>> ...
>>>>
>>>> vms_gather_munmap_vmas() does nothing but figures out what to do later,
>>>> but could use memory and can fail.
>>>>
>>>> The user implications are:
>>>>
>>>> 1. The return type on the error may change to -EPERM from -ENOMEM, if
>>>> you are not allowed to expand and are trying to overwrite mseal()'ed
>>>> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
>>>>
>>>>
>>>> 2. arch_unmap() called prior to may_expand_vm().
>>>> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
>>>> within the unmap range.  User implication of this means that an
>>>> application my set the vdso to NULL prior to hitting the -ENOMEM case in
>>>> may_expand_vm() due to the address space limit.
>>>>
>>>> Assuming the removal of the vdso does not cause the application to seg
>>>> fault, then the user visible change is that any vdso call after a failed
>>>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
>>>> would fail is if the mapping process was attempting to map a large
>>>> enough area over the vdso (which is accounted and in the vma tree,
>>>> afaict) and ran out of memory. Note that this situation could arise
>>>> already since we could run out of memory (not accounting) after the
>>>> arch_unmap() call within the kernel.
>>>>
>>>> The code today can suffer the same fate, but not by the accounting
>>>> failure.  It can happen due to failure to allocate a new vma,
>>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
>>>> failure scenarios later in the mmap_region() function.
>>>>
>>>> At the very least, this requires an expanded change log.
>>>
>>> Indeed, also (as mentioned on IRC) I feel like we need to look at whether
>>> we _truly_ need this arch_unmap() call for a single, rather antiquated,
>>> architecture.
>>
>> You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is
>> factually wrong :) Power10 came out of the fab just a few years ago at
>> 7nm.
> 
> Fair point ;) perhaps we could go with "rarified"? :>)
> 
>>
>>> I mean why are they unmapping the VDSO, why is that valid, why does it need
>>> that field to be set to NULL, is it possible to signify that in some other
>>> way etc.?
>>
>> It was originally for CRIU. So a niche workload on a niche architecture.
>>
>> But from the commit that added it, it sounds like CRIU was using mremap,
>> which should be handled these days by vdso_mremap(). So it could be that
>> arch_unmap() is not actually needed for CRIU anymore.
> 
> Oh that's interesting!
> 
>>
>> Then I guess we have to decide if removing our arch_unmap() would be an
>> ABI break, regardless of whether CRIU needs it or not.
> 
> Seems to me like an internal implementation detail that should hopefully
> not result in anything that should have visible ABI impact?
> 
> I guess this is something we ought to assess. It would be useful to
> eliminate hooks where we can so we can better control VMA behaviour without
> having to worry about an arch being able to do arbitrary things at
> unexpected times, especially pertinent where we change the order of things.
> 

I see you are talking about arch_unmap(). I didn't follow the entire 
discussion but we have some related stuff here: 
https://github.com/linuxppc/issues/issues/241

If I remember correctly arch_unmap() should have gone away we Dmitry's 
series 
https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/#r 
but it hasn't been applied yet.

Christophe

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 07/16] mm/mmap: Extract validate_mm() from vma_complete()
  2024-07-05 19:35   ` Lorenzo Stoakes
@ 2024-07-10 16:06     ` Suren Baghdasaryan
  0 siblings, 0 replies; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 16:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, linux-mm, Andrew Morton, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Fri, Jul 5, 2024 at 12:35 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jul 04, 2024 at 02:27:09PM GMT, Liam R. Howlett wrote:
> > vma_complete() will need to be called during an unsafe time to call
> > validate_mm().  Extract the call in all places now so that only one
> > location can be modified in the next change.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > ---
> >  mm/mmap.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 2a1a49f98fa3..8d9be791997a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -646,7 +646,6 @@ static inline void vma_complete(struct vma_prepare *vp,
> >       }
> >       if (vp->insert && vp->file)
> >               uprobe_mmap(vp->insert);
> > -     validate_mm(mm);
> >  }
> >
> >  /*
> > @@ -734,6 +733,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       vma_iter_store(vmi, vma);
> >
> >       vma_complete(&vp, vmi, vma->vm_mm);
> > +     validate_mm(vma->vm_mm);
> >       return 0;
> >
> >  nomem:
> > @@ -775,6 +775,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       vma_iter_clear(vmi);
> >       vma_set_range(vma, start, end, pgoff);
> >       vma_complete(&vp, vmi, vma->vm_mm);
> > +     validate_mm(vma->vm_mm);
> >       return 0;
> >  }
> >
> > @@ -1103,6 +1104,7 @@ static struct vm_area_struct
> >       }
> >
> >       vma_complete(&vp, vmi, mm);
> > +     validate_mm(mm);
> >       khugepaged_enter_vma(res, vm_flags);
> >       return res;
> >
> > @@ -2481,6 +2483,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >
> >       /* vma_complete stores the new vma */
> >       vma_complete(&vp, vmi, vma->vm_mm);
> > +     validate_mm(vma->vm_mm);
> >
> >       /* Success. */
> >       if (new_below)
> > @@ -3353,6 +3356,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >               vma_iter_store(vmi, vma);
> >
> >               vma_complete(&vp, vmi, mm);
> > +             validate_mm(mm);
> >               khugepaged_enter_vma(vma, flags);
> >               goto out;
> >       }
> > --
> > 2.43.0
> >
> >
>
> LGTM
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas()
  2024-07-05 18:12     ` Liam R. Howlett
@ 2024-07-10 16:06       ` Suren Baghdasaryan
  0 siblings, 0 replies; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 16:06 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, linux-mm, Andrew Morton,
	Suren Baghdasaryan, Vlastimil Babka, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook

On Fri, Jul 5, 2024 at 11:12 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 13:02]:
> > On Thu, Jul 04, 2024 at 02:27:04PM GMT, Liam R. Howlett wrote:
> > > Extract clean up of failed munmap() operations from
> > > do_vmi_align_munmap().  This simplifies later patches in the series.
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


> > > ---
> > >  mm/mmap.c | 25 ++++++++++++++++++++-----
> > >  1 file changed, 20 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 28a46d9ddde0..d572e1ff8255 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2586,6 +2586,25 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
> > >                      vma->vm_userfaultfd_ctx, anon_vma_name(vma));
> > >  }
> > >
> > > +/*
> > > + * abort_munmap_vmas - Undo any munmap work and free resources
> > > + *
> > > + * Reattach detached vmas, free up maple tree used to track the vmas.
> > > + */
> > > +static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> > > +{
> > > +   struct vm_area_struct *vma;
> > > +   int limit;
> > > +
> > > +   limit = mas_detach->index;
> >
> > This feels like a change to existing behaviour actually, I mean a sensible
> > one - as you are not just walking the tree start-to-end but rather only
> > walking up to the point that it has been populated (assuming I'm not
> > missing anything, looks to me like mas_for_each is _inclusive_ on max).
>
> This is not the main tree, but the detached tree.  It only contains the
> vmas that are going to be freed (or, rather aborted from being freed).
>
> I see what you mean that the end in the abort code below would be one
> beyond the tree walk.  The new abort code uses the index (from the
> previous write) as the limit.
>
> All that really matters is that we go to a number high enough to cover
> all vmas that were detached.  I used 'end' in the below code because I
> knew it would cover all of the vmas added (we actually start at index
> 0).
>
> The value of 'mas_detach->index' is used in the new code because I knew
> that's as far as I had to go, and I could limit the arguments passed
> to the function.
>
> I think that I'll actually change limit to ULONG_MAX in another revision
> because I like that better than expecting the index to have not been
> touched by others.
>
> >
> > Maybe  worth mentioning in commit msg?
>
> Yes, good idea.  Thanks for catching this.
>
> >
> > > +   mas_set(mas_detach, 0);
> > > +   /* Re-attach any detached VMAs */
> > > +   mas_for_each(mas_detach, vma, limit)
> > > +           vma_mark_detached(vma, false);
> > > +
> > > +   __mt_destroy(mas_detach->tree);
> > > +}
> > > +
> > >  /*
> > >   * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> > >   * @vmi: The vma iterator
> > > @@ -2740,11 +2759,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > >  userfaultfd_error:
> > >  munmap_gather_failed:
> > >  end_split_failed:
> > > -   mas_set(&mas_detach, 0);
> > > -   mas_for_each(&mas_detach, next, end)
> > > -           vma_mark_detached(next, false);
> > > -
> > > -   __mt_destroy(&mt_detach);
> > > +   abort_munmap_vmas(&mas_detach);
> > >  start_split_failed:
> > >  map_count_exceeded:
> > >     validate_mm(mm);
> > > --
> > > 2.43.0
> > >
> >
> > This looks fine though, feel free to add:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas()
  2024-07-04 18:27 ` [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
  2024-07-05 17:39   ` Lorenzo Stoakes
@ 2024-07-10 16:07   ` Suren Baghdasaryan
  1 sibling, 0 replies; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 16:07 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 4, 2024 at 11:27 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> Extract all necessary operations that need to be completed after the vma
> maple tree is updated from a munmap() operation.  Extracting this makes
> the later patch in the series easier to understand.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/mmap.c | 79 +++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 53 insertions(+), 26 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d572e1ff8255..411798f46932 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2605,6 +2605,56 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
>         __mt_destroy(mas_detach->tree);
>  }
>
> +/*
> + * vmi_complete_munmap_vmas() - Finish the munmap() operation
> + * @vmi: The vma iterator

You are missing vma and mm documentation. With that fixed

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


> + * @start: The start address
> + * @end: The end address
> + * @unlock: Unlock the mm or not
> + * @mas_detach: them maple state of the detached vma maple tree
> + * @locked_vm: The locked_vm count in the detached vmas
> + *
> + * This function updates the mm_struct, unmaps the region, frees the resources
> + * used for the munmap() and may downgrade the lock - if requested.  Everything
> + * needed to be done once the vma maple tree is updated.
> + */
> +static void
> +vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
> +               struct mm_struct *mm, unsigned long start,
> +               unsigned long end, bool unlock, struct ma_state *mas_detach,
> +               unsigned long locked_vm)
> +{
> +       struct vm_area_struct *prev, *next;
> +       int count;
> +
> +       count = mas_detach->index + 1;
> +       mm->map_count -= count;
> +       mm->locked_vm -= locked_vm;
> +       if (unlock)
> +               mmap_write_downgrade(mm);
> +
> +       prev = vma_iter_prev_range(vmi);
> +       next = vma_next(vmi);
> +       if (next)
> +               vma_iter_prev_range(vmi);
> +
> +       /*
> +        * We can free page tables without write-locking mmap_lock because VMAs
> +        * were isolated before we downgraded mmap_lock.
> +        */
> +       mas_set(mas_detach, 1);
> +       unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
> +                    !unlock);
> +       /* Statistics and freeing VMAs */
> +       mas_set(mas_detach, 0);
> +       remove_mt(mm, mas_detach);
> +       validate_mm(mm);
> +       if (unlock)
> +               mmap_read_unlock(mm);
> +
> +       __mt_destroy(mas_detach->tree);
> +}
> +
>  /*
>   * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
>   * @vmi: The vma iterator
> @@ -2624,7 +2674,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                     struct mm_struct *mm, unsigned long start,
>                     unsigned long end, struct list_head *uf, bool unlock)
>  {
> -       struct vm_area_struct *prev, *next = NULL;
> +       struct vm_area_struct *next = NULL;
>         struct maple_tree mt_detach;
>         int count = 0;
>         int error = -ENOMEM;
> @@ -2728,31 +2778,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                 goto clear_tree_failed;
>
>         /* Point of no return */
> -       mm->locked_vm -= locked_vm;
> -       mm->map_count -= count;
> -       if (unlock)
> -               mmap_write_downgrade(mm);
> -
> -       prev = vma_iter_prev_range(vmi);
> -       next = vma_next(vmi);
> -       if (next)
> -               vma_iter_prev_range(vmi);
> -
> -       /*
> -        * We can free page tables without write-locking mmap_lock because VMAs
> -        * were isolated before we downgraded mmap_lock.
> -        */
> -       mas_set(&mas_detach, 1);
> -       unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
> -                    !unlock);
> -       /* Statistics and freeing VMAs */
> -       mas_set(&mas_detach, 0);
> -       remove_mt(mm, &mas_detach);
> -       validate_mm(mm);
> -       if (unlock)
> -               mmap_read_unlock(mm);
> -
> -       __mt_destroy(&mt_detach);
> +       vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
> +                                locked_vm);
>         return 0;
>
>  clear_tree_failed:
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  2024-07-04 18:27 ` [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
  2024-07-05 18:01   ` Lorenzo Stoakes
@ 2024-07-10 16:07   ` Suren Baghdasaryan
  2024-07-10 16:32     ` Liam R. Howlett
  1 sibling, 1 reply; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 16:07 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 4, 2024 at 11:27 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
> detached maple tree for removal later.  Part of the gathering is the
> splitting of vmas that span the boundary.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> ---
>  mm/mmap.c | 82 +++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 58 insertions(+), 24 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 411798f46932..8dc8ffbf9d8d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2656,32 +2656,29 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  }
>
>  /*
> - * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> + * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
> + * for removal at a later date.  Handles splitting first and last if necessary
> + * and marking the vmas as isolated.
> + *
>   * @vmi: The vma iterator
>   * @vma: The starting vm_area_struct
>   * @mm: The mm_struct
>   * @start: The aligned start address to munmap.
>   * @end: The aligned end address to munmap.
>   * @uf: The userfaultfd list_head
> - * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
> - * success.
> + * @mas_detach: The maple state tracking the detached tree
>   *
> - * Return: 0 on success and drops the lock if so directed, error and leaves the
> - * lock held otherwise.
> + * Return: 0 on success
>   */
>  static int
> -do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> +vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                     struct mm_struct *mm, unsigned long start,
> -                   unsigned long end, struct list_head *uf, bool unlock)
> +                   unsigned long end, struct list_head *uf,
> +                   struct ma_state *mas_detach, unsigned long *locked_vm)
>  {
>         struct vm_area_struct *next = NULL;
> -       struct maple_tree mt_detach;
>         int count = 0;
>         int error = -ENOMEM;
> -       unsigned long locked_vm = 0;
> -       MA_STATE(mas_detach, &mt_detach, 0, 0);
> -       mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> -       mt_on_stack(mt_detach);
>
>         /*
>          * If we need to split any vma, do it now to save pain later.
> @@ -2720,15 +2717,14 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>                                 goto end_split_failed;
>                 }
>                 vma_start_write(next);
> -               mas_set(&mas_detach, count);
> -               error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
> +               mas_set(mas_detach, count++);
> +               if (next->vm_flags & VM_LOCKED)
> +                       *locked_vm += vma_pages(next);

Uh, this was confusing. You moved locked_vm/count accounting before
mas_store_gfp(), so if mas_store_gfp() fails, they will be one-off
because we incremented them but failed to insert the element. Only
later I realized that if mas_store_gfp() fails then we never use these
counters. The code is still correct but I'm wondering if this movement
was necessary. We wouldn't use wrong values but why make them wrong in
the first place?
In later patches you account for more things in here and all that is
also done before mas_store_gfp(). Would moving all that after
mas_store_gfp() and keeping them always correct be an issue?




> +
> +               error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
>                 if (error)
>                         goto munmap_gather_failed;
>                 vma_mark_detached(next, true);
> -               if (next->vm_flags & VM_LOCKED)
> -                       locked_vm += vma_pages(next);
> -
> -               count++;
>                 if (unlikely(uf)) {
>                         /*
>                          * If userfaultfd_unmap_prep returns an error the vmas
> @@ -2753,7 +2749,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
>         /* Make sure no VMAs are about to be lost. */
>         {
> -               MA_STATE(test, &mt_detach, 0, 0);
> +               MA_STATE(test, mas_detach->tree, 0, 0);
>                 struct vm_area_struct *vma_mas, *vma_test;
>                 int test_count = 0;
>
> @@ -2773,6 +2769,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>         while (vma_iter_addr(vmi) > start)
>                 vma_iter_prev_range(vmi);
>
> +       return 0;
> +
> +userfaultfd_error:
> +munmap_gather_failed:
> +end_split_failed:
> +       abort_munmap_vmas(mas_detach);
> +start_split_failed:
> +map_count_exceeded:
> +       return error;
> +}
> +
> +/*
> + * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> + * @vmi: The vma iterator
> + * @vma: The starting vm_area_struct
> + * @mm: The mm_struct
> + * @start: The aligned start address to munmap.
> + * @end: The aligned end address to munmap.
> + * @uf: The userfaultfd list_head
> + * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
> + * success.
> + *
> + * Return: 0 on success and drops the lock if so directed, error and leaves the
> + * lock held otherwise.
> + */
> +static int
> +do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> +                   struct mm_struct *mm, unsigned long start,
> +                   unsigned long end, struct list_head *uf, bool unlock)
> +{
> +       struct maple_tree mt_detach;
> +       MA_STATE(mas_detach, &mt_detach, 0, 0);
> +       mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> +       mt_on_stack(mt_detach);
> +       int error;
> +       unsigned long locked_vm = 0;
> +
> +       error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
> +                                      &mas_detach, &locked_vm);
> +       if (error)
> +               goto gather_failed;
> +
>         error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
>         if (error)
>                 goto clear_tree_failed;
> @@ -2783,12 +2821,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
>         return 0;
>
>  clear_tree_failed:
> -userfaultfd_error:
> -munmap_gather_failed:
> -end_split_failed:
>         abort_munmap_vmas(&mas_detach);
> -start_split_failed:
> -map_count_exceeded:
> +gather_failed:
>         validate_mm(mm);
>         return error;
>  }
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  2024-07-05 19:09     ` Liam R. Howlett
@ 2024-07-10 16:07       ` Suren Baghdasaryan
  2024-07-10 16:30         ` Liam R. Howlett
  0 siblings, 1 reply; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 16:07 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, linux-mm, Andrew Morton,
	Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Fri, Jul 5, 2024 at 12:09 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 14:39]:
> > On Thu, Jul 04, 2024 at 02:27:07PM GMT, Liam R. Howlett wrote:
> > > Use a structure to pass along all the necessary information and counters
> > > involved in removing vmas from the mm_struct.
> > >
> > > Update vmi_ function names to vms_ to indicate the first argument
> > > type change.
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  mm/internal.h |  16 ++++++
> > >  mm/mmap.c     | 137 ++++++++++++++++++++++++++------------------------
> > >  2 files changed, 88 insertions(+), 65 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 2ea9a88dcb95..f1e6dea2efcf 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -1481,6 +1481,22 @@ struct vma_prepare {
> > >     struct vm_area_struct *remove2;
> > >  };
> > >
> > > +/*
> > > + * vma munmap operation
> > > + */
> > > +struct vma_munmap_struct {
> > > +   struct vma_iterator *vmi;
> > > +   struct mm_struct *mm;
> > > +   struct vm_area_struct *vma;     /* The first vma to munmap */
> > > +   struct list_head *uf;           /* Userfaultfd list_head */
> > > +   unsigned long start;            /* Aligned start addr */
> > > +   unsigned long end;              /* Aligned end addr */
> > > +   int vma_count;                  /* Number of vmas that will be removed */
> > > +   unsigned long nr_pages;         /* Number of pages being removed */
> > > +   unsigned long locked_vm;        /* Number of locked pages */
> > > +   bool unlock;                    /* Unlock after the munmap */
> > > +};
> >
> >
> > I'm a big fan of breaking out and threading state like this through some of
> > these more... verbose VMA functions.
> >
> > I have a similar idea as part of my long dreamed of 'delete vma_merge()'
> > patch set. Coming soon :)
> >
> > > +
> > >  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > >                             unsigned long zone, int nid);
> > >
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index 8dc8ffbf9d8d..76e93146ee9d 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -500,6 +500,31 @@ static inline void init_vma_prep(struct vma_prepare *vp,
> > >     init_multi_vma_prep(vp, vma, NULL, NULL, NULL);
> > >  }
> > >
> > > +/*
> > > + * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> > > + * @vms: The vma munmap struct
> > > + * @vmi: The vma iterator
> > > + * @vma: The first vm_area_struct to munmap
> > > + * @start: The aligned start address to munmap
> > > + * @end: The aligned end address to munmap
> >
> > Maybe worth mentioning if inclusive/exclusive.
>
> The "address to munmap" isn't specific enough that we are using the same
> logic as the munmap call?  That is, the vma inclusive and exclusive for
> start and end, respectively.
>
> Not a big change, either way.

+1. Every time I look into these functions with start/end I have to go
back and check these inclusive/exclusive rules, so mentioning it would
be helpful.


>
> ...
>
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks
>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
  2024-07-05 19:59     ` Liam R. Howlett
@ 2024-07-10 16:07       ` Suren Baghdasaryan
  2024-07-10 17:29         ` Suren Baghdasaryan
  0 siblings, 1 reply; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 16:07 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, linux-mm, Andrew Morton,
	Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Fri, Jul 5, 2024 at 12:59 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 15:27]:
> > On Thu, Jul 04, 2024 at 02:27:08PM GMT, Liam R. Howlett wrote:
> > > Clean up the code by changing the munmap operation to use a structure
> > > for the accounting and munmap variables.
> > >
> > > Since remove_mt() is only called in one location and the contents will
> > > be reduce to almost nothing.  The remains of the function can be added
> > > to vms_complete_munmap_vmas().
> > >
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > ---
> > >  mm/internal.h |  6 ++++
> > >  mm/mmap.c     | 81 ++++++++++++++++++++++++++-------------------------
> > >  2 files changed, 47 insertions(+), 40 deletions(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index f1e6dea2efcf..8cbbbe7d40f3 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -1488,12 +1488,18 @@ struct vma_munmap_struct {
> > >     struct vma_iterator *vmi;
> > >     struct mm_struct *mm;
> > >     struct vm_area_struct *vma;     /* The first vma to munmap */
> > > +   struct vm_area_struct *next;    /* vma after the munmap area */
> > > +   struct vm_area_struct *prev;    /* vma before the munmap area */
> >
> > I mean this is about as pedantic as it gets, and, admittedly an annoying
> > comment to make, but the ordering... can't we at least put prev before
> > next? ;)
>
> I can do that, no problem.
> ...
>
> >
> > This is a big improvement overall, very fiddly code.
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
> Thanks!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-10 12:59             ` LEROY Christophe
@ 2024-07-10 16:09               ` Liam R. Howlett
  2024-07-10 19:27                 ` Dmitry Safonov
  2024-07-10 21:04                 ` LEROY Christophe
  0 siblings, 2 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-10 16:09 UTC (permalink / raw)
  To: LEROY Christophe
  Cc: Lorenzo Stoakes, Michael Ellerman, linux-mm, Andrew Morton,
	Suren Baghdasaryan, Vlastimil Babka, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, Kees Cook, linuxppc-dev, Dmitry Safonov

* LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 08:59]:
> 
...
> >>>>
> >>>> Assuming the removal of the vdso does not cause the application to seg
> >>>> fault, then the user visible change is that any vdso call after a failed
> >>>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> >>>> would fail is if the mapping process was attempting to map a large
> >>>> enough area over the vdso (which is accounted and in the vma tree,
> >>>> afaict) and ran out of memory. Note that this situation could arise
> >>>> already since we could run out of memory (not accounting) after the
> >>>> arch_unmap() call within the kernel.
> >>>>
> >>>> The code today can suffer the same fate, but not by the accounting
> >>>> failure.  It can happen due to failure to allocate a new vma,
> >>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> >>>> failure scenarios later in the mmap_region() function.
> >>>>
> >>>> At the very least, this requires an expanded change log.
> >>>
...

> >>> I mean why are they unmapping the VDSO, why is that valid, why does it need
> >>> that field to be set to NULL, is it possible to signify that in some other
> >>> way etc.?
> >>
> >> It was originally for CRIU. So a niche workload on a niche architecture.
> >>
> >> But from the commit that added it, it sounds like CRIU was using mremap,
> >> which should be handled these days by vdso_mremap(). So it could be that
> >> arch_unmap() is not actually needed for CRIU anymore.
> > 
> > Oh that's interesting!
> > 
> >>
> >> Then I guess we have to decide if removing our arch_unmap() would be an
> >> ABI break, regardless of whether CRIU needs it or not.
> > 
> > Seems to me like an internal implementation detail that should hopefully
> > not result in anything that should have visible ABI impact?
> > 
> > I guess this is something we ought to assess. It would be useful to
> > eliminate hooks where we can so we can better control VMA behaviour without
> > having to worry about an arch being able to do arbitrary things at
> > unexpected times, especially pertinent where we change the order of things.
> > 
> 
> I see you are talking about arch_unmap(). I didn't follow the entire 
> discussion but we have some related stuff here: 
> https://github.com/linuxppc/issues/issues/241
> 
> If I remember correctly arch_unmap() should have gone away we Dmitry's 
> series 
> https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/#r 
> but it hasn't been applied yet.
> 

That is good news!

To review, ppc is the only arch using this now and it sounds like you
want to remove it too.

Considering the age of that thread and the possibility of conflict with
my series, can I drop the entire arch_unmap() function instead of
modifying the handling in core mm?  I'm going to assume that's okay and
start working on this for v4 (because there hasn't been a public reply
for v4 since 2023/10/11).

This would mean less arch-specific hooks and that's always a good idea.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region()
  2024-07-04 18:27 ` [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
  2024-07-05 19:39   ` Lorenzo Stoakes
@ 2024-07-10 16:15   ` Suren Baghdasaryan
  2024-07-10 16:35     ` Liam R. Howlett
  1 sibling, 1 reply; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 16:15 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 4, 2024 at 11:27 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> mmap_region is already passed sanitized addr and len, so change the
> call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
> checks.

Hmm. I think such refactoring when you want to skip some checks would
be done a bit differently... You would introduce a __do_vmi_munmap()
function which is called at the end of do_vmi_munmap() after the
checks and then call __do_vmi_munmap() directly wherever you want to
skip the checks. That would avoid code duplication. Any reason that
can't be done here?

>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8d9be791997a..e9858ca8bbd4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2937,12 +2937,20 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>                         return -ENOMEM;
>         }
>
> -       /* Unmap any existing mapping in the area */
> -       error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
> -       if (error == -EPERM)
> -               return error;
> -       else if (error)
> -               return -ENOMEM;
> +
> +       if (unlikely(!can_modify_mm(mm, addr, end)))
> +               return -EPERM;
> +
> +        /* arch_unmap() might do unmaps itself.  */
> +       arch_unmap(mm, addr, end);
> +
> +       /* Find the first overlapping VMA */
> +       vma = vma_find(&vmi, end);
> +       if (vma) {
> +               if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> +                       return -ENOMEM;
> +               vma = NULL;
> +       }
>
>         /*
>          * Private writable mapping: check memory availability
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  2024-07-10 16:07       ` Suren Baghdasaryan
@ 2024-07-10 16:30         ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-10 16:30 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Lorenzo Stoakes, linux-mm, Andrew Morton, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

* Suren Baghdasaryan <surenb@google.com> [240710 12:07]:
> On Fri, Jul 5, 2024 at 12:09 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 14:39]:
> > > On Thu, Jul 04, 2024 at 02:27:07PM GMT, Liam R. Howlett wrote:
> > > > Use a structure to pass along all the necessary information and counters
> > > > involved in removing vmas from the mm_struct.
> > > >
> > > > Update vmi_ function names to vms_ to indicate the first argument
> > > > type change.
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  mm/internal.h |  16 ++++++
> > > >  mm/mmap.c     | 137 ++++++++++++++++++++++++++------------------------
> > > >  2 files changed, 88 insertions(+), 65 deletions(-)
> > > >
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 2ea9a88dcb95..f1e6dea2efcf 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -1481,6 +1481,22 @@ struct vma_prepare {
> > > >     struct vm_area_struct *remove2;
> > > >  };
> > > >
> > > > +/*
> > > > + * vma munmap operation
> > > > + */
> > > > +struct vma_munmap_struct {
> > > > +   struct vma_iterator *vmi;
> > > > +   struct mm_struct *mm;
> > > > +   struct vm_area_struct *vma;     /* The first vma to munmap */
> > > > +   struct list_head *uf;           /* Userfaultfd list_head */
> > > > +   unsigned long start;            /* Aligned start addr */
> > > > +   unsigned long end;              /* Aligned end addr */
> > > > +   int vma_count;                  /* Number of vmas that will be removed */
> > > > +   unsigned long nr_pages;         /* Number of pages being removed */
> > > > +   unsigned long locked_vm;        /* Number of locked pages */
> > > > +   bool unlock;                    /* Unlock after the munmap */
> > > > +};
> > >
> > >
> > > I'm a big fan of breaking out and threading state like this through some of
> > > these more... verbose VMA functions.
> > >
> > > I have a similar idea as part of my long dreamed of 'delete vma_merge()'
> > > patch set. Coming soon :)
> > >
> > > > +
> > > >  void __meminit __init_single_page(struct page *page, unsigned long pfn,
> > > >                             unsigned long zone, int nid);
> > > >
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index 8dc8ffbf9d8d..76e93146ee9d 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -500,6 +500,31 @@ static inline void init_vma_prep(struct vma_prepare *vp,
> > > >     init_multi_vma_prep(vp, vma, NULL, NULL, NULL);
> > > >  }
> > > >
> > > > +/*
> > > > + * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> > > > + * @vms: The vma munmap struct
> > > > + * @vmi: The vma iterator
> > > > + * @vma: The first vm_area_struct to munmap
> > > > + * @start: The aligned start address to munmap
> > > > + * @end: The aligned end address to munmap
> > >
> > > Maybe worth mentioning if inclusive/exclusive.
> >
> > The "address to munmap" isn't specific enough that we are using the same
> > logic as the munmap call?  That is, the vma inclusive and exclusive for
> > start and end, respectively.
> >
> > Not a big change, either way.
> 
> +1. Every time I look into these functions with start/end I have to go
> back and check these inclusive/exclusive rules, so mentioning it would
> be helpful.

I am making this clear with the follow in v4:
+       unsigned long start;            /* Aligned start addr (inclusive) */
+       unsigned long end;              /* Aligned end addr (exclusive) */

Any time we deal with the vma it is like this, the maple tree is
inclusive/inclusive.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  2024-07-10 16:07   ` Suren Baghdasaryan
@ 2024-07-10 16:32     ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-10 16:32 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Suren Baghdasaryan <surenb@google.com> [240710 12:07]:
> On Thu, Jul 4, 2024 at 11:27 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
> > detached maple tree for removal later.  Part of the gathering is the
> > splitting of vmas that span the boundary.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > ---
> >  mm/mmap.c | 82 +++++++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 58 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 411798f46932..8dc8ffbf9d8d 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2656,32 +2656,29 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  }
> >
> >  /*
> > - * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> > + * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
> > + * for removal at a later date.  Handles splitting first and last if necessary
> > + * and marking the vmas as isolated.
> > + *
> >   * @vmi: The vma iterator
> >   * @vma: The starting vm_area_struct
> >   * @mm: The mm_struct
> >   * @start: The aligned start address to munmap.
> >   * @end: The aligned end address to munmap.
> >   * @uf: The userfaultfd list_head
> > - * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
> > - * success.
> > + * @mas_detach: The maple state tracking the detached tree
> >   *
> > - * Return: 0 on success and drops the lock if so directed, error and leaves the
> > - * lock held otherwise.
> > + * Return: 0 on success
> >   */
> >  static int
> > -do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > +vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >                     struct mm_struct *mm, unsigned long start,
> > -                   unsigned long end, struct list_head *uf, bool unlock)
> > +                   unsigned long end, struct list_head *uf,
> > +                   struct ma_state *mas_detach, unsigned long *locked_vm)
> >  {
> >         struct vm_area_struct *next = NULL;
> > -       struct maple_tree mt_detach;
> >         int count = 0;
> >         int error = -ENOMEM;
> > -       unsigned long locked_vm = 0;
> > -       MA_STATE(mas_detach, &mt_detach, 0, 0);
> > -       mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > -       mt_on_stack(mt_detach);
> >
> >         /*
> >          * If we need to split any vma, do it now to save pain later.
> > @@ -2720,15 +2717,14 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >                                 goto end_split_failed;
> >                 }
> >                 vma_start_write(next);
> > -               mas_set(&mas_detach, count);
> > -               error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
> > +               mas_set(mas_detach, count++);
> > +               if (next->vm_flags & VM_LOCKED)
> > +                       *locked_vm += vma_pages(next);
> 
> Uh, this was confusing. You moved locked_vm/count accounting before
> mas_store_gfp(), so if mas_store_gfp() fails, they will be one-off
> because we incremented them but failed to insert the element. Only
> later I realized that if mas_store_gfp() fails then we never use these
> counters. The code is still correct but I'm wondering if this movement
> was necessary. We wouldn't use wrong values but why make them wrong in
> the first place?
> In later patches you account for more things in here and all that is
> also done before mas_store_gfp(). Would moving all that after
> mas_store_gfp() and keeping them always correct be an issue?

The accounting is only ever used in the even of a successful munmap()
operation, but I can make this change.  I didn't see this as the author
so thanks for pointing it out.

> 
> 
> 
> 
> > +
> > +               error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
> >                 if (error)
> >                         goto munmap_gather_failed;
> >                 vma_mark_detached(next, true);
> > -               if (next->vm_flags & VM_LOCKED)
> > -                       locked_vm += vma_pages(next);
> > -
> > -               count++;
> >                 if (unlikely(uf)) {
> >                         /*
> >                          * If userfaultfd_unmap_prep returns an error the vmas
> > @@ -2753,7 +2749,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> >         /* Make sure no VMAs are about to be lost. */
> >         {
> > -               MA_STATE(test, &mt_detach, 0, 0);
> > +               MA_STATE(test, mas_detach->tree, 0, 0);
> >                 struct vm_area_struct *vma_mas, *vma_test;
> >                 int test_count = 0;
> >
> > @@ -2773,6 +2769,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >         while (vma_iter_addr(vmi) > start)
> >                 vma_iter_prev_range(vmi);
> >
> > +       return 0;
> > +
> > +userfaultfd_error:
> > +munmap_gather_failed:
> > +end_split_failed:
> > +       abort_munmap_vmas(mas_detach);
> > +start_split_failed:
> > +map_count_exceeded:
> > +       return error;
> > +}
> > +
> > +/*
> > + * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
> > + * @vmi: The vma iterator
> > + * @vma: The starting vm_area_struct
> > + * @mm: The mm_struct
> > + * @start: The aligned start address to munmap.
> > + * @end: The aligned end address to munmap.
> > + * @uf: The userfaultfd list_head
> > + * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
> > + * success.
> > + *
> > + * Return: 0 on success and drops the lock if so directed, error and leaves the
> > + * lock held otherwise.
> > + */
> > +static int
> > +do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > +                   struct mm_struct *mm, unsigned long start,
> > +                   unsigned long end, struct list_head *uf, bool unlock)
> > +{
> > +       struct maple_tree mt_detach;
> > +       MA_STATE(mas_detach, &mt_detach, 0, 0);
> > +       mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > +       mt_on_stack(mt_detach);
> > +       int error;
> > +       unsigned long locked_vm = 0;
> > +
> > +       error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
> > +                                      &mas_detach, &locked_vm);
> > +       if (error)
> > +               goto gather_failed;
> > +
> >         error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
> >         if (error)
> >                 goto clear_tree_failed;
> > @@ -2783,12 +2821,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >         return 0;
> >
> >  clear_tree_failed:
> > -userfaultfd_error:
> > -munmap_gather_failed:
> > -end_split_failed:
> >         abort_munmap_vmas(&mas_detach);
> > -start_split_failed:
> > -map_count_exceeded:
> > +gather_failed:
> >         validate_mm(mm);
> >         return error;
> >  }
> > --
> > 2.43.0
> >


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region()
  2024-07-10 16:15   ` Suren Baghdasaryan
@ 2024-07-10 16:35     ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-10 16:35 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Suren Baghdasaryan <surenb@google.com> [240710 12:16]:
> On Thu, Jul 4, 2024 at 11:27 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > mmap_region is already passed sanitized addr and len, so change the
> > call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
> > checks.
> 
> Hmm. I think such refactoring when you want to skip some checks would
> be done a bit differently... You would introduce a __do_vmi_munmap()
> function which is called at the end of do_vmi_munmap() after the
> checks and then call __do_vmi_munmap() directly wherever you want to
> skip the checks. That would avoid code duplication. Any reason that
> can't be done here?

Because I'm going to completely change and remove the checks in future
patches.  I should have added this to the log that this is an internal
step for reviewing.

> 
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 20 ++++++++++++++------
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 8d9be791997a..e9858ca8bbd4 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2937,12 +2937,20 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >                         return -ENOMEM;
> >         }
> >
> > -       /* Unmap any existing mapping in the area */
> > -       error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
> > -       if (error == -EPERM)
> > -               return error;
> > -       else if (error)
> > -               return -ENOMEM;
> > +
> > +       if (unlikely(!can_modify_mm(mm, addr, end)))
> > +               return -EPERM;
> > +
> > +        /* arch_unmap() might do unmaps itself.  */
> > +       arch_unmap(mm, addr, end);
> > +
> > +       /* Find the first overlapping VMA */
> > +       vma = vma_find(&vmi, end);
> > +       if (vma) {
> > +               if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
> > +                       return -ENOMEM;
> > +               vma = NULL;
> > +       }
> >
> >         /*
> >          * Private writable mapping: check memory availability
> > --
> > 2.43.0
> >
> 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region()
  2024-07-04 18:27 ` [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
  2024-07-05 20:18   ` Lorenzo Stoakes
@ 2024-07-10 16:48   ` Suren Baghdasaryan
  2024-07-10 17:18     ` Liam R. Howlett
  1 sibling, 1 reply; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 16:48 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Thu, Jul 4, 2024 at 11:27 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Instead of moving (or leaving) the vma iterator pointing at the previous
> vma, leave it pointing at the insert location.  Pointing the vma
> iterator at the insert location allows for a cleaner walk of the vma
> tree for MAP_FIXED and the no expansion cases.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ---
>  mm/mmap.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f5b33de4e717..ecf55d32e804 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2963,11 +2963,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>                 vms_complete_munmap_vmas(&vms, &mas_detach);
>                 next = vms.next;
>                 prev = vms.prev;
> -               vma_prev(&vmi);
>                 vma = NULL;
>         } else {
>                 next = vma_next(&vmi);
>                 prev = vma_prev(&vmi);
> +               if (prev)
> +                       vma_iter_next_range(&vmi);
>         }
>
>         /*
> @@ -2980,11 +2981,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>                 vm_flags |= VM_ACCOUNT;
>         }
>
> -       if (vm_flags & VM_SPECIAL) {
> -               if (prev)
> -                       vma_iter_next_range(&vmi);
> +       if (vm_flags & VM_SPECIAL)
>                 goto cannot_expand;
> -       }
>
>         /* Attempt to expand an old mapping */
>         /* Check next */
> @@ -3005,19 +3003,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>                 merge_start = prev->vm_start;
>                 vma = prev;
>                 vm_pgoff = prev->vm_pgoff;
> -       } else if (prev) {
> -               vma_iter_next_range(&vmi);
> +               vma_prev(&vmi);
>         }
>
> -       /* Actually expand, if possible */
> -       if (vma &&
> -           !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> -               khugepaged_enter_vma(vma, vm_flags);
> -               goto expanded;
> +       if (vma) {
> +               /* Actually expand, if possible */
> +               if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> +                       khugepaged_enter_vma(vma, vm_flags);
> +                       goto expanded;
> +               }
> +
> +               /* If the expand fails, then reposition the vma iterator */
> +               if (unlikely(vma == prev))
> +                       vma_iter_set(&vmi, addr);
>         }
>
> -       if (vma == prev)
> -               vma_iter_set(&vmi, addr);

Before this change we would reposition vmi if vma == prev == NULL.
After this change we don't do that. Is this situation possible and if
so, will vmi be correct?

>  cannot_expand:
>
>         /*
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct
  2024-07-05 20:27   ` Lorenzo Stoakes
  2024-07-08 14:45     ` Liam R. Howlett
@ 2024-07-10 17:14     ` Suren Baghdasaryan
  1 sibling, 0 replies; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 17:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, linux-mm, Andrew Morton, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Fri, Jul 5, 2024 at 1:27 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jul 04, 2024 at 02:27:13PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Set the start and end address for munmap when the prev and next are
> > gathered.  This is needed to avoid incorrect addresses being used during
> > the vms_complete_munmap_vmas() function if the prev/next vma are
> > expanded.
>
> When we spoke about this separately you mentioned that specific arches may
> be more likely to encounter this issue, perhaps worth mentioning something
> about that in the commit msg? Unless I misunderstood you.
>
> >
> > Add a new helper vms_complete_pte_clear(), which is needed later and
> > will avoid growing the argument list to unmap_region() beyond the 9 it
> > already has.
>
> My word.
>
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/internal.h |  2 ++
> >  mm/mmap.c     | 34 +++++++++++++++++++++++++++-------
> >  2 files changed, 29 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 8cbbbe7d40f3..4c9f06669cc4 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1493,6 +1493,8 @@ struct vma_munmap_struct {
> >       struct list_head *uf;           /* Userfaultfd list_head */
> >       unsigned long start;            /* Aligned start addr */
> >       unsigned long end;              /* Aligned end addr */
> > +     unsigned long unmap_start;
> > +     unsigned long unmap_end;
> >       int vma_count;                  /* Number of vmas that will be removed */
> >       unsigned long nr_pages;         /* Number of pages being removed */
> >       unsigned long locked_vm;        /* Number of locked pages */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index ecf55d32e804..45443a53be76 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -525,6 +525,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> >       vms->vma_count = 0;
> >       vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
> >       vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> > +     vms->unmap_start = FIRST_USER_ADDRESS;
> > +     vms->unmap_end = USER_PGTABLES_CEILING;
> >  }
> >
> >  /*
> > @@ -2610,6 +2612,26 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> >       __mt_destroy(mas_detach->tree);
> >  }
> >
> > +
> > +static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > +             struct ma_state *mas_detach, bool mm_wr_locked)
> > +{
> > +     struct mmu_gather tlb;
> > +
> > +     /*
> > +      * We can free page tables without write-locking mmap_lock because VMAs
> > +      * were isolated before we downgraded mmap_lock.
> > +      */
> > +     mas_set(mas_detach, 1);
> > +     lru_add_drain();
> > +     tlb_gather_mmu(&tlb, vms->mm);
> > +     update_hiwater_rss(vms->mm);
> > +     unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
> > +     mas_set(mas_detach, 1);
>
> I know it's necessary as unmap_vmas() will adjust mas_detach, but it kind
> of aesthetically sucks to set it to 1, do some stuff, then set it to 1
> again. But this is not a big deal :>)
>
> > +     free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
>
> Yeah this bit definitely needs a comment I think, this is very confusing
> indeed. Under what circumstances will these differ from [vms->start,
> vms->end), etc.?
>
> I'm guessing it's to do with !vms->prev and !vms->next needing to be set to
> [FIRST_USER_ADDRESS, USER_PGTABLES_CEILING)?
>
> > +     tlb_finish_mmu(&tlb);
> > +}
> > +
> >  /*
> >   * vms_complete_munmap_vmas() - Finish the munmap() operation
> >   * @vms: The vma munmap struct
> > @@ -2631,13 +2653,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> >       if (vms->unlock)
> >               mmap_write_downgrade(mm);
> >
> > -     /*
> > -      * We can free page tables without write-locking mmap_lock because VMAs
> > -      * were isolated before we downgraded mmap_lock.
> > -      */
> > -     mas_set(mas_detach, 1);
> > -     unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
> > -                  vms->start, vms->end, vms->vma_count, !vms->unlock);
> > +     vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
> >       /* Update high watermark before we lower total_vm */
> >       update_hiwater_vm(mm);
> >       /* Stat accounting */
> > @@ -2699,6 +2715,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >                       goto start_split_failed;
> >       }
> >       vms->prev = vma_prev(vms->vmi);
> > +     if (vms->prev)
> > +             vms->unmap_start = vms->prev->vm_end;
> >
> >       /*
> >        * Detach a range of VMAs from the mm. Using next as a temp variable as
> > @@ -2757,6 +2775,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> >       }
> >
> >       vms->next = vma_next(vms->vmi);
> > +     if (vms->next)
> > +             vms->unmap_end = vms->next->vm_start;
> >
> >  #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> >       /* Make sure no VMAs are about to be lost. */
> > --
> > 2.43.0
> >
>
> Other than wanting some extra comments, this looks fine and I know how
> hard-won the unmap range bit of this change was so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Ok, another case when code duplication will be removed in the next patch. LGTM.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 12/16] mm/mmap: Clean up unmap_region() argument list
  2024-07-05 20:33   ` Lorenzo Stoakes
@ 2024-07-10 17:14     ` Suren Baghdasaryan
  0 siblings, 0 replies; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 17:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, linux-mm, Andrew Morton, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Fri, Jul 5, 2024 at 1:33 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jul 04, 2024 at 02:27:14PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > With the only caller to unmap_region() being the error path of
> > mmap_region(), the argument list can be significantly reduced.
> >
> > There is also no need to forward declare the static function any
> > longer.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 25 +++++++++----------------
> >  1 file changed, 9 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 45443a53be76..5d458c5f080e 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -76,11 +76,6 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
> >  static bool ignore_rlimit_data;
> >  core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
> >
> > -static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
> > -             struct vm_area_struct *vma, struct vm_area_struct *prev,
> > -             struct vm_area_struct *next, unsigned long start,
> > -             unsigned long end, unsigned long tree_end, bool mm_wr_locked);
> > -
> >  static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
> >  {
> >       return pgprot_modify(oldprot, vm_get_page_prot(vm_flags));
> > @@ -2398,22 +2393,21 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
> >   *
> >   * Called with the mm semaphore held.
> >   */
> > -static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
> > -             struct vm_area_struct *vma, struct vm_area_struct *prev,
> > -             struct vm_area_struct *next, unsigned long start,
> > -             unsigned long end, unsigned long tree_end, bool mm_wr_locked)
> > +static void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> > +             struct vm_area_struct *prev, struct vm_area_struct *next)
> >  {
> > +     struct mm_struct *mm = vma->vm_mm;
> >       struct mmu_gather tlb;
> > -     unsigned long mt_start = mas->index;
> >
> >       lru_add_drain();
> >       tlb_gather_mmu(&tlb, mm);
> >       update_hiwater_rss(mm);
> > -     unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
> > -     mas_set(mas, mt_start);
>
> OK so the 'raise something in one patch only for it to be removed in the
> next' pattern continues :) more aesthetically plasing.
>
> > +     unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
> > +                /* mm_wr_locked = */ true);
> > +     mas_set(mas, vma->vm_end);
> >       free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
> > -                              next ? next->vm_start : USER_PGTABLES_CEILING,
> > -                              mm_wr_locked);
> > +                   next ? next->vm_start : USER_PGTABLES_CEILING,
> > +                   /* mm_wr_locked = */ true);
> >       tlb_finish_mmu(&tlb);
> >  }
> >
> > @@ -3186,8 +3180,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >
> >               vma_iter_set(&vmi, vma->vm_end);
> >               /* Undo any partial mapping done by a device driver. */
> > -             unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
> > -                          vma->vm_end, vma->vm_end, true);
> > +             unmap_region(&vmi.mas, vma, prev, next);
> >       }
> >       if (writable_file_mapping)
> >               mapping_unmap_writable(file->f_mapping);
> > --
> > 2.43.0
> >
>
> Lovely reduction in parameters.
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region()
  2024-07-10 16:48   ` Suren Baghdasaryan
@ 2024-07-10 17:18     ` Liam R. Howlett
  0 siblings, 0 replies; 78+ messages in thread
From: Liam R. Howlett @ 2024-07-10 17:18 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, Andrew Morton, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Suren Baghdasaryan <surenb@google.com> [240710 12:48]:
> On Thu, Jul 4, 2024 at 11:27 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Instead of moving (or leaving) the vma iterator pointing at the previous
> > vma, leave it pointing at the insert location.  Pointing the vma
> > iterator at the insert location allows for a cleaner walk of the vma
> > tree for MAP_FIXED and the no expansion cases.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 28 ++++++++++++++--------------
> >  1 file changed, 14 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f5b33de4e717..ecf55d32e804 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2963,11 +2963,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >                 vms_complete_munmap_vmas(&vms, &mas_detach);
> >                 next = vms.next;
> >                 prev = vms.prev;
> > -               vma_prev(&vmi);
> >                 vma = NULL;
> >         } else {
> >                 next = vma_next(&vmi);
> >                 prev = vma_prev(&vmi);
> > +               if (prev)
> > +                       vma_iter_next_range(&vmi);
> >         }
> >
> >         /*
> > @@ -2980,11 +2981,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >                 vm_flags |= VM_ACCOUNT;
> >         }
> >
> > -       if (vm_flags & VM_SPECIAL) {
> > -               if (prev)
> > -                       vma_iter_next_range(&vmi);
> > +       if (vm_flags & VM_SPECIAL)
> >                 goto cannot_expand;
> > -       }
> >
> >         /* Attempt to expand an old mapping */
> >         /* Check next */
> > @@ -3005,19 +3003,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >                 merge_start = prev->vm_start;
> >                 vma = prev;
> >                 vm_pgoff = prev->vm_pgoff;
> > -       } else if (prev) {
> > -               vma_iter_next_range(&vmi);
> > +               vma_prev(&vmi);
> >         }
> >
> > -       /* Actually expand, if possible */
> > -       if (vma &&
> > -           !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> > -               khugepaged_enter_vma(vma, vm_flags);
> > -               goto expanded;
> > +       if (vma) {
> > +               /* Actually expand, if possible */
> > +               if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
> > +                       khugepaged_enter_vma(vma, vm_flags);
> > +                       goto expanded;
> > +               }
> > +
> > +               /* If the expand fails, then reposition the vma iterator */
> > +               if (unlikely(vma == prev))
> > +                       vma_iter_set(&vmi, addr);
> >         }
> >
> > -       if (vma == prev)
> > -               vma_iter_set(&vmi, addr);
> 
> Before this change we would reposition vmi if vma == prev == NULL.
> After this change we don't do that. Is this situation possible and if
> so, will vmi be correct?

vma == NULL can happen if it is a MAP_FIXED or we cannot expand (no
prev/next that match the checks for expanding).  If there is no prev,
then we can have a case where vma == prev == NULL.  So the situation can
be reached.

In the MAP_FIXED situation, the vms_complete_munmap_vmas() call will
return with the iterator pointing at the range of addr, so this is fine.

In the other case, we cannot move the vma iterator to prev since there
isn't one and the vma iterator is left pointing at the first range in
the tree (0 to .. whatever) that includes the addr.

So yes it is possible and yes it is correctly positioned.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
  2024-07-10 16:07       ` Suren Baghdasaryan
@ 2024-07-10 17:29         ` Suren Baghdasaryan
  0 siblings, 0 replies; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 17:29 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, linux-mm, Andrew Morton,
	Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Wed, Jul 10, 2024 at 9:07 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, Jul 5, 2024 at 12:59 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240705 15:27]:
> > > On Thu, Jul 04, 2024 at 02:27:08PM GMT, Liam R. Howlett wrote:
> > > > Clean up the code by changing the munmap operation to use a structure
> > > > for the accounting and munmap variables.
> > > >
> > > > Since remove_mt() is only called in one location and the contents will
> > > > be reduce to almost nothing.  The remains of the function can be added

nit: s/reduce/reduced

> > > > to vms_complete_munmap_vmas().
> > > >
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > ---
> > > >  mm/internal.h |  6 ++++
> > > >  mm/mmap.c     | 81 ++++++++++++++++++++++++++-------------------------
> > > >  2 files changed, 47 insertions(+), 40 deletions(-)
> > > >
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index f1e6dea2efcf..8cbbbe7d40f3 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -1488,12 +1488,18 @@ struct vma_munmap_struct {
> > > >     struct vma_iterator *vmi;
> > > >     struct mm_struct *mm;
> > > >     struct vm_area_struct *vma;     /* The first vma to munmap */
> > > > +   struct vm_area_struct *next;    /* vma after the munmap area */
> > > > +   struct vm_area_struct *prev;    /* vma before the munmap area */
> > >
> > > I mean this is about as pedantic as it gets, and, admittedly an annoying
> > > comment to make, but the ordering... can't we at least put prev before
> > > next? ;)
> >
> > I can do that, no problem.
> > ...
> >
> > >
> > > This is a big improvement overall, very fiddly code.
> > >
> > > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
> >
> > Thanks!


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 14/16] mm/mmap: Use PHYS_PFN in mmap_region()
  2024-07-08 12:21   ` Lorenzo Stoakes
  2024-07-09 18:35     ` Liam R. Howlett
@ 2024-07-10 17:32     ` Suren Baghdasaryan
  1 sibling, 0 replies; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 17:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, linux-mm, Andrew Morton, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Mon, Jul 8, 2024 at 5:21 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jul 04, 2024 at 02:27:16PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Instead of shifting the length by PAGE_SIZE, use PHYS_PFN.  Also use the
> > existing local variable everywhere instead of some of the time.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > ---
> >  mm/mmap.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 0c334eeae8cd..b14da6bd257f 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2935,7 +2935,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >       struct mm_struct *mm = current->mm;
> >       struct vm_area_struct *vma = NULL;
> >       struct vm_area_struct *next, *prev, *merge;
> > -     pgoff_t pglen = len >> PAGE_SHIFT;
> > +     pgoff_t pglen = PHYS_PFN(len);
> >       unsigned long charged = 0;
> >       struct vma_munmap_struct vms;
> >       struct ma_state mas_detach;
> > @@ -2955,7 +2955,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >        * MAP_FIXED may remove pages of mappings that intersects with requested
> >        * mapping. Account for the pages it would unmap.
> >        */
> > -     if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> > +     if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> >               return -ENOMEM;
> >
> >       if (unlikely(!can_modify_mm(mm, addr, end)))
> > @@ -2990,7 +2990,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >        * Private writable mapping: check memory availability
> >        */
> >       if (accountable_mapping(file, vm_flags)) {
> > -             charged = len >> PAGE_SHIFT;
> > +             charged = pglen;
> >               charged -= nr_accounted;
> >               if (security_vm_enough_memory_mm(mm, charged))
> >                       goto abort_munmap;
> > @@ -3149,14 +3149,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >       if (vms.nr_pages)
> >               vms_complete_munmap_vmas(&vms, &mas_detach);
> >
> > -     vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > +     vm_stat_account(mm, vm_flags, pglen);
> >       if (vm_flags & VM_LOCKED) {
> >               if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> >                                       is_vm_hugetlb_page(vma) ||
> >                                       vma == get_gate_vma(current->mm))
> >                       vm_flags_clear(vma, VM_LOCKED_MASK);
> >               else
> > -                     mm->locked_vm += (len >> PAGE_SHIFT);
> > +                     mm->locked_vm += pglen;
> >       }
> >
> >       if (file)
> > --
> > 2.43.0
> >
>
> Maybe I should literally look ahead before making comments :)) thanks for
> reading my mind and doing what I asked though! ;)
>
> However I don't think you've fixed the duplication of PHYS_PFN(vm_end -
> vm_start) in count_vma_pages_range() - still worth doing I think.
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 15/16] mm/mmap: Use vms accounted pages in mmap_region()
  2024-07-08 12:43   ` Lorenzo Stoakes
@ 2024-07-10 17:43     ` Suren Baghdasaryan
  0 siblings, 0 replies; 78+ messages in thread
From: Suren Baghdasaryan @ 2024-07-10 17:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, linux-mm, Andrew Morton, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook

On Mon, Jul 8, 2024 at 5:43 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Jul 04, 2024 at 02:27:17PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Change from nr_pages variable to vms.nr_accounted for the charged pages
> > calculation.  This is necessary for a future patch.
> >
> > This also avoids checking security_vm_enough_memory_mm() if the amount
> > of memory won't change.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Cc: Kees Cook <kees@kernel.org>
> > ---
> >  mm/mmap.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index b14da6bd257f..b2de26683903 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2980,6 +2980,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >       } else {
> >               /* Minimal setup of vms */
> >               vms.nr_pages = 0;
> > +             vms.nr_accounted = 0;
>
> This kind of highlights my concern about only setting some vms fields, now we
> have to remember to change this in the right place or happen to know that
> init_vma_munmap() will be otherwise invoked.
>
> >               next = vma_next(&vmi);
> >               prev = vma_prev(&vmi);
> >               if (prev)
> > @@ -2991,9 +2992,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >        */
> >       if (accountable_mapping(file, vm_flags)) {
> >               charged = pglen;
> > -             charged -= nr_accounted;
> > -             if (security_vm_enough_memory_mm(mm, charged))
> > +             charged -= vms.nr_accounted;
> > +             if (charged && security_vm_enough_memory_mm(mm, charged))
> >                       goto abort_munmap;
> > +
> >               vms.nr_accounted = 0;
>
> Is setting this to zero really needed here? We may be done with this, but if the
> vms value represents the 'unmap state' of this range, surely the number of
> accountable pages remains the same?
>
> >               vm_flags |= VM_ACCOUNT;
> >       }
> > --
> > 2.43.0
> >
>
> At this point nr_accounted is no longer used, but I'm guessing a follow up patch
> will remove this? :)

IMHO this and the next patch can be combined to remove this confusion.
They are both rather small, so would not be a big deal.


>
> I was wondering why you used that given the gather function also separately
> calculates it, but I guess this answers that!
>
> Generally this looks good to me, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-10 16:09               ` Liam R. Howlett
@ 2024-07-10 19:27                 ` Dmitry Safonov
  2024-07-10 21:04                 ` LEROY Christophe
  1 sibling, 0 replies; 78+ messages in thread
From: Dmitry Safonov @ 2024-07-10 19:27 UTC (permalink / raw)
  To: Liam R. Howlett, LEROY Christophe, Lorenzo Stoakes,
	Michael Ellerman, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	Kees Cook, linuxppc-dev, Dmitry Safonov

Hi Liam,

On Wed, Jul 10, 2024 at 5:09 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 08:59]:
> >
> ...
> > >>>>
> > >>>> Assuming the removal of the vdso does not cause the application to seg
> > >>>> fault, then the user visible change is that any vdso call after a failed
> > >>>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
> > >>>> would fail is if the mapping process was attempting to map a large
> > >>>> enough area over the vdso (which is accounted and in the vma tree,
> > >>>> afaict) and ran out of memory. Note that this situation could arise
> > >>>> already since we could run out of memory (not accounting) after the
> > >>>> arch_unmap() call within the kernel.
> > >>>>
> > >>>> The code today can suffer the same fate, but not by the accounting
> > >>>> failure.  It can happen due to failure to allocate a new vma,
> > >>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> > >>>> failure scenarios later in the mmap_region() function.
> > >>>>
> > >>>> At the very least, this requires an expanded change log.
> > >>>
> ...
>
> > >>> I mean why are they unmapping the VDSO, why is that valid, why does it need
> > >>> that field to be set to NULL, is it possible to signify that in some other
> > >>> way etc.?
> > >>
> > >> It was originally for CRIU. So a niche workload on a niche architecture.
> > >>
> > >> But from the commit that added it, it sounds like CRIU was using mremap,
> > >> which should be handled these days by vdso_mremap(). So it could be that
> > >> arch_unmap() is not actually needed for CRIU anymore.
> > >
> > > Oh that's interesting!
> > >
> > >>
> > >> Then I guess we have to decide if removing our arch_unmap() would be an
> > >> ABI break, regardless of whether CRIU needs it or not.
> > >
> > > Seems to me like an internal implementation detail that should hopefully
> > > not result in anything that should have visible ABI impact?
> > >
> > > I guess this is something we ought to assess. It would be useful to
> > > eliminate hooks where we can so we can better control VMA behaviour without
> > > having to worry about an arch being able to do arbitrary things at
> > > unexpected times, especially pertinent where we change the order of things.
> > >
> >
> > I see you are talking about arch_unmap(). I didn't follow the entire
> > discussion but we have some related stuff here:
> > https://github.com/linuxppc/issues/issues/241
> >
> > If I remember correctly arch_unmap() should have gone away we Dmitry's
> > series
> > https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/#r
> > but it hasn't been applied yet.
> >
>
> That is good news!
>
> To review, ppc is the only arch using this now and it sounds like you
> want to remove it too.
>
> Considering the age of that thread and the possibility of conflict with
> my series, can I drop the entire arch_unmap() function instead of
> modifying the handling in core mm?  I'm going to assume that's okay and
> start working on this for v4 (because there hasn't been a public reply
> for v4 since 2023/10/11).

Yeah, this kind of felt through the cracks. I meant to find time to
push v4, but from my job side I got motivated to do core networking
changes that were required by customers, from the other side I got
demotivated a bit by slight pushback for v3 with "justify why is it
needed at all?". For changes that are mostly cleanups and refactoring.

So, usually I don't give up on patches sets that yet make sense to me,
but priorities+motivation changed and the set moved lower on my todo
list.

If you have time and urge to finish this patch set, you are more than
welcome to adopt it :-)
Otherwise, I'll try to find time for them, but not in near short-term
as at this moment I cook tcp & selftests changes that I'd love to see
upstream.

> This would mean less arch-specific hooks and that's always a good idea.

Thanks,
           Dmitry


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-07-10 16:09               ` Liam R. Howlett
  2024-07-10 19:27                 ` Dmitry Safonov
@ 2024-07-10 21:04                 ` LEROY Christophe
  1 sibling, 0 replies; 78+ messages in thread
From: LEROY Christophe @ 2024-07-10 21:04 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, Michael Ellerman, linux-mm,
	Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook, linuxppc-dev,
	Dmitry Safonov



Le 10/07/2024 à 18:09, Liam R. Howlett a écrit :
> * LEROY Christophe <christophe.leroy2@cs-soprasteria.com> [240710 08:59]:
>>
> ...
>>>>>>
>>>>>> Assuming the removal of the vdso does not cause the application to seg
>>>>>> fault, then the user visible change is that any vdso call after a failed
>>>>>> mmap(MAP_FIXED) call would result in a seg fault.  The only reason it
>>>>>> would fail is if the mapping process was attempting to map a large
>>>>>> enough area over the vdso (which is accounted and in the vma tree,
>>>>>> afaict) and ran out of memory. Note that this situation could arise
>>>>>> already since we could run out of memory (not accounting) after the
>>>>>> arch_unmap() call within the kernel.
>>>>>>
>>>>>> The code today can suffer the same fate, but not by the accounting
>>>>>> failure.  It can happen due to failure to allocate a new vma,
>>>>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
>>>>>> failure scenarios later in the mmap_region() function.
>>>>>>
>>>>>> At the very least, this requires an expanded change log.
>>>>>
> ...
>
>>>>> I mean why are they unmapping the VDSO, why is that valid, why does it need
>>>>> that field to be set to NULL, is it possible to signify that in some other
>>>>> way etc.?
>>>>
>>>> It was originally for CRIU. So a niche workload on a niche architecture.
>>>>
>>>> But from the commit that added it, it sounds like CRIU was using mremap,
>>>> which should be handled these days by vdso_mremap(). So it could be that
>>>> arch_unmap() is not actually needed for CRIU anymore.
>>>
>>> Oh that's interesting!
>>>
>>>>
>>>> Then I guess we have to decide if removing our arch_unmap() would be an
>>>> ABI break, regardless of whether CRIU needs it or not.
>>>
>>> Seems to me like an internal implementation detail that should hopefully
>>> not result in anything that should have visible ABI impact?
>>>
>>> I guess this is something we ought to assess. It would be useful to
>>> eliminate hooks where we can so we can better control VMA behaviour without
>>> having to worry about an arch being able to do arbitrary things at
>>> unexpected times, especially pertinent where we change the order of things.
>>>
>>
>> I see you are talking about arch_unmap(). I didn't follow the entire
>> discussion but we have some related stuff here:
>> https://github.com/linuxppc/issues/issues/241
>>
>> If I remember correctly arch_unmap() should have gone away we Dmitry's
>> series
>> https://lore.kernel.org/lkml/20210611180242.711399-1-dima@arista.com/#r
>> but it hasn't been applied yet.
>>
>
> That is good news!
>
> To review, ppc is the only arch using this now and it sounds like you
> want to remove it too.

Yes want to remove it but needs to be replaced by a more generic
core-based equivalent.

>
> Considering the age of that thread and the possibility of conflict with
> my series, can I drop the entire arch_unmap() function instead of
> modifying the handling in core mm?  I'm going to assume that's okay and
> start working on this for v4 (because there hasn't been a public reply
> for v4 since 2023/10/11).

drop it yes but not without implementing a replacement in core mm like
proposed by Dmitry.

>
> This would mean less arch-specific hooks and that's always a good idea.
>

Indeed.

Christophe

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2024-07-10 21:04 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-04 18:27 [PATCH v3 00/16] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 01/16] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 02/16] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
2024-07-05 17:02   ` Lorenzo Stoakes
2024-07-05 18:12     ` Liam R. Howlett
2024-07-10 16:06       ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 03/16] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
2024-07-05 17:39   ` Lorenzo Stoakes
2024-07-10 16:07   ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 04/16] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
2024-07-05 18:01   ` Lorenzo Stoakes
2024-07-05 18:41     ` Liam R. Howlett
2024-07-10 16:07   ` Suren Baghdasaryan
2024-07-10 16:32     ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 05/16] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
2024-07-05 18:39   ` Lorenzo Stoakes
2024-07-05 19:09     ` Liam R. Howlett
2024-07-10 16:07       ` Suren Baghdasaryan
2024-07-10 16:30         ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 06/16] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
2024-07-05 19:27   ` Lorenzo Stoakes
2024-07-05 19:59     ` Liam R. Howlett
2024-07-10 16:07       ` Suren Baghdasaryan
2024-07-10 17:29         ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 07/16] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
2024-07-05 19:35   ` Lorenzo Stoakes
2024-07-10 16:06     ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 08/16] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
2024-07-05 19:39   ` Lorenzo Stoakes
2024-07-05 20:00     ` Liam R. Howlett
2024-07-10 16:15   ` Suren Baghdasaryan
2024-07-10 16:35     ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 09/16] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
2024-07-05 20:06   ` Lorenzo Stoakes
2024-07-05 20:30     ` Liam R. Howlett
2024-07-05 20:36       ` Lorenzo Stoakes
2024-07-08 14:49         ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 10/16] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
2024-07-05 20:18   ` Lorenzo Stoakes
2024-07-05 20:56     ` Liam R. Howlett
2024-07-08 11:08       ` Lorenzo Stoakes
2024-07-08 16:43         ` Liam R. Howlett
2024-07-10 16:48   ` Suren Baghdasaryan
2024-07-10 17:18     ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
2024-07-05 20:27   ` Lorenzo Stoakes
2024-07-08 14:45     ` Liam R. Howlett
2024-07-10 17:14     ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 12/16] mm/mmap: Clean up unmap_region() argument list Liam R. Howlett
2024-07-05 20:33   ` Lorenzo Stoakes
2024-07-10 17:14     ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 13/16] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
2024-07-08 12:18   ` Lorenzo Stoakes
2024-07-08 19:10     ` Liam R. Howlett
2024-07-09 14:27       ` Lorenzo Stoakes
2024-07-09 18:43         ` Liam R. Howlett
2024-07-04 18:27 ` [PATCH v3 14/16] mm/mmap: Use PHYS_PFN " Liam R. Howlett
2024-07-08 12:21   ` Lorenzo Stoakes
2024-07-09 18:35     ` Liam R. Howlett
2024-07-09 18:42       ` Lorenzo Stoakes
2024-07-10 17:32     ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 15/16] mm/mmap: Use vms accounted pages " Liam R. Howlett
2024-07-08 12:43   ` Lorenzo Stoakes
2024-07-10 17:43     ` Suren Baghdasaryan
2024-07-04 18:27 ` [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check " Liam R. Howlett
2024-07-08 12:52   ` Lorenzo Stoakes
2024-07-08 20:43     ` Liam R. Howlett
2024-07-09 14:42       ` Liam R. Howlett
2024-07-09 14:51         ` Lorenzo Stoakes
2024-07-09 14:52         ` Liam R. Howlett
2024-07-09 18:13           ` Dave Hansen
2024-07-09 14:45       ` Lorenzo Stoakes
2024-07-10 12:28         ` Michael Ellerman
2024-07-10 12:45           ` Lorenzo Stoakes
2024-07-10 12:59             ` LEROY Christophe
2024-07-10 16:09               ` Liam R. Howlett
2024-07-10 19:27                 ` Dmitry Safonov
2024-07-10 21:04                 ` LEROY Christophe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox