linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
@ 2024-06-25 19:11 Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 01/15] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
                   ` (15 more replies)
  0 siblings, 16 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

It is now possible to walk the vma tree using the rcu read locks and is
beneficial to do so to reduce lock contention.  Doing so while a
MAP_FIXED mapping is executing means that a reader may see a gap in the
vma tree that should never logically exist - and does not when using the
mmap lock in read mode.  The temporal gap exists because mmap_region()
calls munmap() prior to installing the new mapping.

This patch set stops rcu readers from seeing the temporal gap by
splitting up the munmap() function into two parts.  The first part
prepares the vma tree for modifications by doing the necessary splits
and tracks the vmas marked for removal in a side tree.  The second part
completes the munmapping of the vmas after the vma tree has been
overwritten (either by a MAP_FIXED replacement vma or by a NULL in the
munmap() case).

Please note that rcu walkers will still be able to see a temporary state
of split vmas that may be in the process of being removed, but the
temporal gap will not be exposed.  vma_start_write() are called on both
parts of the split vma, so this state is detectable.

RFC: https://lore.kernel.org/linux-mm/20240531163217.1584450-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20240611180200.711239-1-Liam.Howlett@oracle.com/

Changes since v1:
- Fixed issue with expanding vmas clearing the incorrect PTE range,
  Thanks Bert Karwatzki and Jiri Olsa for testing linux-next
- Separated patches into smaller portions for easier reviewing
- Replaced open coded shifting of length with PHYS_PFN()
- Changed security calculation (Cc Kees on this change)
- Changed the vma iterator use to point to the targeted area instead of
  the previous vma
- Remove page table entries prior to fully removing the vmas, if a
  driver may do mappings
- Tested with stress-ng and split/joining of VMA at the start, end, and
  middle of a vma.

Liam R. Howlett (15):
  mm/mmap: Correctly position vma_iterator in __split_vma()
  mm/mmap: Introduce abort_munmap_vmas()
  mm/mmap: Introduce vmi_complete_munmap_vmas()
  mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  mm/mmap: Change munmap to use vma_munmap_struct() for accounting and
    surrounding vmas
  mm/mmap: Extract validate_mm() from vma_complete()
  mm/mmap: Inline munmap operation in mmap_region()
  mm/mmap: Expand mmap_region() munmap call
  mm/mmap: Reposition vma iterator in mmap_region()
  mm/mmap: Track start and end of munmap in vma_munmap_struct
  mm/mmap: Avoid zeroing vma tree in mmap_region()
  mm/mmap: Use PHYS_PFN in mmap_region()
  mm/mmap: Use vms accounted pages in mmap_region()
  mm/mmap: Move may_expand_vm() check in mmap_region()

 mm/internal.h |  25 +++
 mm/mmap.c     | 449 +++++++++++++++++++++++++++++++-------------------
 2 files changed, 304 insertions(+), 170 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 01/15] mm/mmap: Correctly position vma_iterator in __split_vma()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 02/15] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

The vma iterator may be left pointing to the newly created vma.  This
happens when inserting the new vma at the end of the old vma
(!new_below).

The incorrect position in the vma iterator is not exposed currently
since the vma iterator is repositioned in the munmap path and is not
reused in any of the other paths.

This has limited impact in the current code, but is required for future
changes.

Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/mmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index e42d89f98071..28a46d9ddde0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2414,7 +2414,7 @@ static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
 /*
  * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
  * has already been checked or doesn't make sense to fail.
- * VMA Iterator will point to the end VMA.
+ * VMA Iterator will point to the original vma.
  */
 static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		       unsigned long addr, int new_below)
@@ -2483,6 +2483,9 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	/* Success. */
 	if (new_below)
 		vma_next(vmi);
+	else
+		vma_prev(vmi);
+
 	return 0;
 
 out_free_mpol:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 02/15] mm/mmap: Introduce abort_munmap_vmas()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 01/15] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 03/15] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Extract clean up of failed munmap() operations from
do_vmi_align_munmap().  This simplifies later patches in the series.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 28a46d9ddde0..d572e1ff8255 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2586,6 +2586,25 @@ struct vm_area_struct *vma_merge_extend(struct vma_iterator *vmi,
 			 vma->vm_userfaultfd_ctx, anon_vma_name(vma));
 }
 
+/*
+ * abort_munmap_vmas - Undo any munmap work and free resources
+ *
+ * Reattach detached vmas, free up maple tree used to track the vmas.
+ */
+static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+{
+	struct vm_area_struct *vma;
+	int limit;
+
+	limit = mas_detach->index;
+	mas_set(mas_detach, 0);
+	/* Re-attach any detached VMAs */
+	mas_for_each(mas_detach, vma, limit)
+		vma_mark_detached(vma, false);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -2740,11 +2759,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 userfaultfd_error:
 munmap_gather_failed:
 end_split_failed:
-	mas_set(&mas_detach, 0);
-	mas_for_each(&mas_detach, next, end)
-		vma_mark_detached(next, false);
-
-	__mt_destroy(&mt_detach);
+	abort_munmap_vmas(&mas_detach);
 start_split_failed:
 map_count_exceeded:
 	validate_mm(mm);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 03/15] mm/mmap: Introduce vmi_complete_munmap_vmas()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 01/15] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 02/15] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 04/15] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Extract all necessary operations that need to be completed after the vma
maple tree is updated from a munmap() operation.  Extracting this makes
the later patch in the series easier to understand.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 79 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 53 insertions(+), 26 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index d572e1ff8255..411798f46932 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2605,6 +2605,56 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 	__mt_destroy(mas_detach->tree);
 }
 
+/*
+ * vmi_complete_munmap_vmas() - Finish the munmap() operation
+ * @vmi: The vma iterator
+ * @start: The start address
+ * @end: The end address
+ * @unlock: Unlock the mm or not
+ * @mas_detach: them maple state of the detached vma maple tree
+ * @locked_vm: The locked_vm count in the detached vmas
+ *
+ * This function updates the mm_struct, unmaps the region, frees the resources
+ * used for the munmap() and may downgrade the lock - if requested.  Everything
+ * needed to be done once the vma maple tree is updated.
+ */
+static void
+vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		struct mm_struct *mm, unsigned long start,
+		unsigned long end, bool unlock, struct ma_state *mas_detach,
+		unsigned long locked_vm)
+{
+	struct vm_area_struct *prev, *next;
+	int count;
+
+	count = mas_detach->index + 1;
+	mm->map_count -= count;
+	mm->locked_vm -= locked_vm;
+	if (unlock)
+		mmap_write_downgrade(mm);
+
+	prev = vma_iter_prev_range(vmi);
+	next = vma_next(vmi);
+	if (next)
+		vma_iter_prev_range(vmi);
+
+	/*
+	 * We can free page tables without write-locking mmap_lock because VMAs
+	 * were isolated before we downgraded mmap_lock.
+	 */
+	mas_set(mas_detach, 1);
+	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
+		     !unlock);
+	/* Statistics and freeing VMAs */
+	mas_set(mas_detach, 0);
+	remove_mt(mm, mas_detach);
+	validate_mm(mm);
+	if (unlock)
+		mmap_read_unlock(mm);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -2624,7 +2674,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
 		    unsigned long end, struct list_head *uf, bool unlock)
 {
-	struct vm_area_struct *prev, *next = NULL;
+	struct vm_area_struct *next = NULL;
 	struct maple_tree mt_detach;
 	int count = 0;
 	int error = -ENOMEM;
@@ -2728,31 +2778,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto clear_tree_failed;
 
 	/* Point of no return */
-	mm->locked_vm -= locked_vm;
-	mm->map_count -= count;
-	if (unlock)
-		mmap_write_downgrade(mm);
-
-	prev = vma_iter_prev_range(vmi);
-	next = vma_next(vmi);
-	if (next)
-		vma_iter_prev_range(vmi);
-
-	/*
-	 * We can free page tables without write-locking mmap_lock because VMAs
-	 * were isolated before we downgraded mmap_lock.
-	 */
-	mas_set(&mas_detach, 1);
-	unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
-		     !unlock);
-	/* Statistics and freeing VMAs */
-	mas_set(&mas_detach, 0);
-	remove_mt(mm, &mas_detach);
-	validate_mm(mm);
-	if (unlock)
-		mmap_read_unlock(mm);
-
-	__mt_destroy(&mt_detach);
+	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
+				 locked_vm);
 	return 0;
 
 clear_tree_failed:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 04/15] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (2 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 03/15] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 05/15] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
detached maple tree for removal later.  Part of the gathering is the
splitting of vmas that span the boundary.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 82 +++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 58 insertions(+), 24 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 411798f46932..8dc8ffbf9d8d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2656,32 +2656,29 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 }
 
 /*
- * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * for removal at a later date.  Handles splitting first and last if necessary
+ * and marking the vmas as isolated.
+ *
  * @vmi: The vma iterator
  * @vma: The starting vm_area_struct
  * @mm: The mm_struct
  * @start: The aligned start address to munmap.
  * @end: The aligned end address to munmap.
  * @uf: The userfaultfd list_head
- * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
- * success.
+ * @mas_detach: The maple state tracking the detached tree
  *
- * Return: 0 on success and drops the lock if so directed, error and leaves the
- * lock held otherwise.
+ * Return: 0 on success
  */
 static int
-do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
-		    unsigned long end, struct list_head *uf, bool unlock)
+		    unsigned long end, struct list_head *uf,
+		    struct ma_state *mas_detach, unsigned long *locked_vm)
 {
 	struct vm_area_struct *next = NULL;
-	struct maple_tree mt_detach;
 	int count = 0;
 	int error = -ENOMEM;
-	unsigned long locked_vm = 0;
-	MA_STATE(mas_detach, &mt_detach, 0, 0);
-	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
-	mt_on_stack(mt_detach);
 
 	/*
 	 * If we need to split any vma, do it now to save pain later.
@@ -2720,15 +2717,14 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 				goto end_split_failed;
 		}
 		vma_start_write(next);
-		mas_set(&mas_detach, count);
-		error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
+		mas_set(mas_detach, count++);
+		if (next->vm_flags & VM_LOCKED)
+			*locked_vm += vma_pages(next);
+
+		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
 			goto munmap_gather_failed;
 		vma_mark_detached(next, true);
-		if (next->vm_flags & VM_LOCKED)
-			locked_vm += vma_pages(next);
-
-		count++;
 		if (unlikely(uf)) {
 			/*
 			 * If userfaultfd_unmap_prep returns an error the vmas
@@ -2753,7 +2749,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
 	{
-		MA_STATE(test, &mt_detach, 0, 0);
+		MA_STATE(test, mas_detach->tree, 0, 0);
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
@@ -2773,6 +2769,48 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	while (vma_iter_addr(vmi) > start)
 		vma_iter_prev_range(vmi);
 
+	return 0;
+
+userfaultfd_error:
+munmap_gather_failed:
+end_split_failed:
+	abort_munmap_vmas(mas_detach);
+start_split_failed:
+map_count_exceeded:
+	return error;
+}
+
+/*
+ * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * @vmi: The vma iterator
+ * @vma: The starting vm_area_struct
+ * @mm: The mm_struct
+ * @start: The aligned start address to munmap.
+ * @end: The aligned end address to munmap.
+ * @uf: The userfaultfd list_head
+ * @unlock: Set to true to drop the mmap_lock.  unlocking only happens on
+ * success.
+ *
+ * Return: 0 on success and drops the lock if so directed, error and leaves the
+ * lock held otherwise.
+ */
+static int
+do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+		    struct mm_struct *mm, unsigned long start,
+		    unsigned long end, struct list_head *uf, bool unlock)
+{
+	struct maple_tree mt_detach;
+	MA_STATE(mas_detach, &mt_detach, 0, 0);
+	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+	mt_on_stack(mt_detach);
+	int error;
+	unsigned long locked_vm = 0;
+
+	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
+				       &mas_detach, &locked_vm);
+	if (error)
+		goto gather_failed;
+
 	error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
 	if (error)
 		goto clear_tree_failed;
@@ -2783,12 +2821,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	return 0;
 
 clear_tree_failed:
-userfaultfd_error:
-munmap_gather_failed:
-end_split_failed:
 	abort_munmap_vmas(&mas_detach);
-start_split_failed:
-map_count_exceeded:
+gather_failed:
 	validate_mm(mm);
 	return error;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 05/15] mm/mmap: Introduce vma_munmap_struct for use in munmap operations
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (3 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 04/15] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 06/15] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Use a structure to pass along all the necessary information and counters
involved in removing vmas from the mm_struct.

Update vmi_ function names to vms_ to indicate the first argument
type change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/internal.h |  16 ++++++
 mm/mmap.c     | 137 ++++++++++++++++++++++++++------------------------
 2 files changed, 88 insertions(+), 65 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b5e8484059c9..22563c959861 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1480,6 +1480,22 @@ struct vma_prepare {
 	struct vm_area_struct *remove2;
 };
 
+/*
+ * vma munmap operation
+ */
+struct vma_munmap_struct {
+	struct vma_iterator *vmi;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;	/* The first vma to munmap */
+	struct list_head *uf;		/* Userfaultfd list_head */
+	unsigned long start;		/* Aligned start addr */
+	unsigned long end;		/* Aligned end addr */
+	int vma_count;			/* Number of vmas that will be removed */
+	unsigned long nr_pages;		/* Number of pages being removed */
+	unsigned long locked_vm;	/* Number of locked pages */
+	bool unlock;			/* Unlock after the munmap */
+};
+
 void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 8dc8ffbf9d8d..76e93146ee9d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -500,6 +500,31 @@ static inline void init_vma_prep(struct vma_prepare *vp,
 	init_multi_vma_prep(vp, vma, NULL, NULL, NULL);
 }
 
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation.  Only unlocked on success
+ */
+static inline void init_vma_munmap(struct vma_munmap_struct *vms,
+		struct vma_iterator *vmi, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, struct list_head *uf,
+		bool unlock)
+{
+	vms->vmi = vmi;
+	vms->vma = vma;
+	vms->mm = vma->vm_mm;
+	vms->start = start;
+	vms->end = end;
+	vms->unlock = unlock;
+	vms->uf = uf;
+	vms->vma_count = 0;
+	vms->nr_pages = vms->locked_vm = 0;
+}
 
 /*
  * vma_prepare() - Helper function for handling locking VMAs prior to altering
@@ -2606,78 +2631,59 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
 }
 
 /*
- * vmi_complete_munmap_vmas() - Finish the munmap() operation
- * @vmi: The vma iterator
- * @start: The start address
- * @end: The end address
- * @unlock: Unlock the mm or not
- * @mas_detach: them maple state of the detached vma maple tree
- * @locked_vm: The locked_vm count in the detached vmas
- *
- * This function updates the mm_struct, unmaps the region, frees the resources
- * used for the munmap() and may downgrade the lock - if requested.  Everything
- * needed to be done once the vma maple tree is updated.
+ * vms_complete_munmap_vmas() - Finish the munmap() operation
+ * @vms: The vma munmap struct
+ * @mas_detach: The maple state of the detached vmas
  */
-static void
-vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		struct mm_struct *mm, unsigned long start,
-		unsigned long end, bool unlock, struct ma_state *mas_detach,
-		unsigned long locked_vm)
+
+static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *prev, *next;
-	int count;
+	struct mm_struct *mm;
 
-	count = mas_detach->index + 1;
-	mm->map_count -= count;
-	mm->locked_vm -= locked_vm;
-	if (unlock)
+	mm = vms->mm;
+	mm->map_count -= vms->vma_count;
+	mm->locked_vm -= vms->locked_vm;
+	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	prev = vma_iter_prev_range(vmi);
-	next = vma_next(vmi);
+	prev = vma_iter_prev_range(vms->vmi);
+	next = vma_next(vms->vmi);
 	if (next)
-		vma_iter_prev_range(vmi);
+		vma_iter_prev_range(vms->vmi);
 
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
 	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
-		     !unlock);
+	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
+		     vms->vma_count, !vms->unlock);
 	/* Statistics and freeing VMAs */
 	mas_set(mas_detach, 0);
 	remove_mt(mm, mas_detach);
 	validate_mm(mm);
-	if (unlock)
+	if (vms->unlock)
 		mmap_read_unlock(mm);
 
 	__mt_destroy(mas_detach->tree);
 }
 
 /*
- * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
  * for removal at a later date.  Handles splitting first and last if necessary
  * and marking the vmas as isolated.
  *
- * @vmi: The vma iterator
- * @vma: The starting vm_area_struct
- * @mm: The mm_struct
- * @start: The aligned start address to munmap.
- * @end: The aligned end address to munmap.
- * @uf: The userfaultfd list_head
+ * @vms: The vma munmap struct
  * @mas_detach: The maple state tracking the detached tree
  *
  * Return: 0 on success
  */
-static int
-vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
-		    struct mm_struct *mm, unsigned long start,
-		    unsigned long end, struct list_head *uf,
-		    struct ma_state *mas_detach, unsigned long *locked_vm)
+static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *next = NULL;
-	int count = 0;
 	int error = -ENOMEM;
 
 	/*
@@ -2689,17 +2695,18 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 */
 
 	/* Does it split the first one? */
-	if (start > vma->vm_start) {
+	if (vms->start > vms->vma->vm_start) {
 
 		/*
 		 * Make sure that map_count on return from munmap() will
 		 * not exceed its limit; but let map_count go just above
 		 * its limit temporarily, to help free resources as expected.
 		 */
-		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+		if (vms->end < vms->vma->vm_end &&
+		    vms->mm->map_count >= sysctl_max_map_count)
 			goto map_count_exceeded;
 
-		error = __split_vma(vmi, vma, start, 1);
+		error = __split_vma(vms->vmi, vms->vma, vms->start, 1);
 		if (error)
 			goto start_split_failed;
 	}
@@ -2708,24 +2715,24 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	next = vma;
+	next = vms->vma;
 	do {
 		/* Does it split the end? */
-		if (next->vm_end > end) {
-			error = __split_vma(vmi, next, end, 0);
+		if (next->vm_end > vms->end) {
+			error = __split_vma(vms->vmi, next, vms->end, 0);
 			if (error)
 				goto end_split_failed;
 		}
 		vma_start_write(next);
-		mas_set(mas_detach, count++);
+		mas_set(mas_detach, vms->vma_count++);
 		if (next->vm_flags & VM_LOCKED)
-			*locked_vm += vma_pages(next);
+			vms->locked_vm += vma_pages(next);
 
 		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
 			goto munmap_gather_failed;
 		vma_mark_detached(next, true);
-		if (unlikely(uf)) {
+		if (unlikely(vms->uf)) {
 			/*
 			 * If userfaultfd_unmap_prep returns an error the vmas
 			 * will remain split, but userland will get a
@@ -2735,16 +2742,17 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 			 * split, despite we could. This is unlikely enough
 			 * failure that it's not worth optimizing it for.
 			 */
-			error = userfaultfd_unmap_prep(next, start, end, uf);
+			error = userfaultfd_unmap_prep(next, vms->start,
+						       vms->end, vms->uf);
 
 			if (error)
 				goto userfaultfd_error;
 		}
 #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
-		BUG_ON(next->vm_start < start);
-		BUG_ON(next->vm_start > end);
+		BUG_ON(next->vm_start < vms->start);
+		BUG_ON(next->vm_start > vms->end);
 #endif
-	} for_each_vma_range(*vmi, next, end);
+	} for_each_vma_range(*(vms->vmi), next, vms->end);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
@@ -2753,21 +2761,21 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		struct vm_area_struct *vma_mas, *vma_test;
 		int test_count = 0;
 
-		vma_iter_set(vmi, start);
+		vma_iter_set(vms->vmi, vms->start);
 		rcu_read_lock();
-		vma_test = mas_find(&test, count - 1);
-		for_each_vma_range(*vmi, vma_mas, end) {
+		vma_test = mas_find(&test, vms->vma_count - 1);
+		for_each_vma_range(*(vms->vmi), vma_mas, vms->end) {
 			BUG_ON(vma_mas != vma_test);
 			test_count++;
-			vma_test = mas_next(&test, count - 1);
+			vma_test = mas_next(&test, vms->vma_count - 1);
 		}
 		rcu_read_unlock();
-		BUG_ON(count != test_count);
+		BUG_ON(vms->vma_count != test_count);
 	}
 #endif
 
-	while (vma_iter_addr(vmi) > start)
-		vma_iter_prev_range(vmi);
+	while (vma_iter_addr(vms->vmi) > vms->start)
+		vma_iter_prev_range(vms->vmi);
 
 	return 0;
 
@@ -2803,11 +2811,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	MA_STATE(mas_detach, &mt_detach, 0, 0);
 	mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
 	mt_on_stack(mt_detach);
+	struct vma_munmap_struct vms;
 	int error;
-	unsigned long locked_vm = 0;
 
-	error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
-				       &mas_detach, &locked_vm);
+	init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
+	error = vms_gather_munmap_vmas(&vms, &mas_detach);
 	if (error)
 		goto gather_failed;
 
@@ -2816,8 +2824,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		goto clear_tree_failed;
 
 	/* Point of no return */
-	vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
-				 locked_vm);
+	vms_complete_munmap_vmas(&vms, &mas_detach);
 	return 0;
 
 clear_tree_failed:
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 06/15] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (4 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 05/15] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 07/15] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

Clean up the code by changing the munmap operation to use a structure
for the accounting and munmap variables.

Since remove_mt() is only called in one location and the contents will
be reduce to almost nothing.  The remains of the function can be added
to vms_complete_munmap_vmas().

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/internal.h |  6 ++++
 mm/mmap.c     | 81 ++++++++++++++++++++++++++-------------------------
 2 files changed, 47 insertions(+), 40 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 22563c959861..90cab15c3b81 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1487,12 +1487,18 @@ struct vma_munmap_struct {
 	struct vma_iterator *vmi;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;	/* The first vma to munmap */
+	struct vm_area_struct *next;	/* vma after the munmap area */
+	struct vm_area_struct *prev;    /* vma before the munmap area */
 	struct list_head *uf;		/* Userfaultfd list_head */
 	unsigned long start;		/* Aligned start addr */
 	unsigned long end;		/* Aligned end addr */
 	int vma_count;			/* Number of vmas that will be removed */
 	unsigned long nr_pages;		/* Number of pages being removed */
 	unsigned long locked_vm;	/* Number of locked pages */
+	unsigned long nr_accounted;	/* Number of VM_ACCOUNT pages */
+	unsigned long exec_vm;
+	unsigned long stack_vm;
+	unsigned long data_vm;
 	bool unlock;			/* Unlock after the munmap */
 };
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 76e93146ee9d..2a1a49f98fa3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -523,7 +523,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->unlock = unlock;
 	vms->uf = uf;
 	vms->vma_count = 0;
-	vms->nr_pages = vms->locked_vm = 0;
+	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 }
 
 /*
@@ -2388,30 +2389,6 @@ struct vm_area_struct *expand_stack(struct mm_struct *mm, unsigned long addr)
 	return vma;
 }
 
-/*
- * Ok - we have the memory areas we should free on a maple tree so release them,
- * and do the vma updates.
- *
- * Called with the mm semaphore held.
- */
-static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
-{
-	unsigned long nr_accounted = 0;
-	struct vm_area_struct *vma;
-
-	/* Update high watermark before we lower total_vm */
-	update_hiwater_vm(mm);
-	mas_for_each(mas, vma, ULONG_MAX) {
-		long nrpages = vma_pages(vma);
-
-		if (vma->vm_flags & VM_ACCOUNT)
-			nr_accounted += nrpages;
-		vm_stat_account(mm, vma->vm_flags, -nrpages);
-		remove_vma(vma, false);
-	}
-	vm_unacct_memory(nr_accounted);
-}
-
 /*
  * Get rid of page table information in the indicated region.
  *
@@ -2634,12 +2611,15 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
  * vms_complete_munmap_vmas() - Finish the munmap() operation
  * @vms: The vma munmap struct
  * @mas_detach: The maple state of the detached vmas
+ *
+ * This function updates the mm_struct, unmaps the region, frees the resources
+ * used for the munmap() and may downgrade the lock - if requested.  Everything
+ * needed to be done once the vma maple tree is updated.
  */
-
 static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
-	struct vm_area_struct *prev, *next;
+	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 
 	mm = vms->mm;
@@ -2648,21 +2628,26 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
-	prev = vma_iter_prev_range(vms->vmi);
-	next = vma_next(vms->vmi);
-	if (next)
-		vma_iter_prev_range(vms->vmi);
-
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
 	 */
 	mas_set(mas_detach, 1);
-	unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
-		     vms->vma_count, !vms->unlock);
-	/* Statistics and freeing VMAs */
+	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
+		     vms->start, vms->end, vms->vma_count, !vms->unlock);
+	/* Update high watermark before we lower total_vm */
+	update_hiwater_vm(mm);
+	/* Stat accounting */
+	WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
+	mm->exec_vm -= vms->exec_vm;
+	mm->stack_vm -= vms->stack_vm;
+	mm->data_vm -= vms->data_vm;
+	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
-	remove_mt(mm, mas_detach);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		remove_vma(vma, false);
+
+	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
 	if (vms->unlock)
 		mmap_read_unlock(mm);
@@ -2710,13 +2695,14 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		if (error)
 			goto start_split_failed;
 	}
+	vms->prev = vma_prev(vms->vmi);
 
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
 	 * it is always overwritten.
 	 */
-	next = vms->vma;
-	do {
+	for_each_vma_range(*(vms->vmi), next, vms->end) {
+		long nrpages;
 		/* Does it split the end? */
 		if (next->vm_end > vms->end) {
 			error = __split_vma(vms->vmi, next, vms->end, 0);
@@ -2725,8 +2711,21 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		}
 		vma_start_write(next);
 		mas_set(mas_detach, vms->vma_count++);
+		nrpages = vma_pages(next);
+
+		vms->nr_pages += nrpages;
 		if (next->vm_flags & VM_LOCKED)
-			vms->locked_vm += vma_pages(next);
+			vms->locked_vm += nrpages;
+
+		if (next->vm_flags & VM_ACCOUNT)
+			vms->nr_accounted += nrpages;
+
+		if (is_exec_mapping(next->vm_flags))
+			vms->exec_vm += nrpages;
+		else if (is_stack_mapping(next->vm_flags))
+			vms->stack_vm += nrpages;
+		else if (is_data_mapping(next->vm_flags))
+			vms->data_vm += nrpages;
 
 		error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
 		if (error)
@@ -2752,7 +2751,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		BUG_ON(next->vm_start < vms->start);
 		BUG_ON(next->vm_start > vms->end);
 #endif
-	} for_each_vma_range(*(vms->vmi), next, vms->end);
+	}
+
+	vms->next = vma_next(vms->vmi);
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 07/15] mm/mmap: Extract validate_mm() from vma_complete()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (5 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 06/15] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 08/15] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

vma_complete() will need to be called during an unsafe time to call
validate_mm().  Extract the call in all places now so that only one
location can be modified in the next change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 2a1a49f98fa3..8d9be791997a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -646,7 +646,6 @@ static inline void vma_complete(struct vma_prepare *vp,
 	}
 	if (vp->insert && vp->file)
 		uprobe_mmap(vp->insert);
-	validate_mm(mm);
 }
 
 /*
@@ -734,6 +733,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);
 
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 	return 0;
 
 nomem:
@@ -775,6 +775,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_clear(vmi);
 	vma_set_range(vma, start, end, pgoff);
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 	return 0;
 }
 
@@ -1103,6 +1104,7 @@ static struct vm_area_struct
 	}
 
 	vma_complete(&vp, vmi, mm);
+	validate_mm(mm);
 	khugepaged_enter_vma(res, vm_flags);
 	return res;
 
@@ -2481,6 +2483,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	/* vma_complete stores the new vma */
 	vma_complete(&vp, vmi, vma->vm_mm);
+	validate_mm(vma->vm_mm);
 
 	/* Success. */
 	if (new_below)
@@ -3353,6 +3356,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		vma_iter_store(vmi, vma);
 
 		vma_complete(&vp, vmi, mm);
+		validate_mm(mm);
 		khugepaged_enter_vma(vma, flags);
 		goto out;
 	}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 08/15] mm/mmap: Inline munmap operation in mmap_region()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (6 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 07/15] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 09/15] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

mmap_region is already passed sanitized addr and len, so change the
call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
checks.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8d9be791997a..e9858ca8bbd4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2937,12 +2937,20 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			return -ENOMEM;
 	}
 
-	/* Unmap any existing mapping in the area */
-	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
-	if (error == -EPERM)
-		return error;
-	else if (error)
-		return -ENOMEM;
+
+	if (unlikely(!can_modify_mm(mm, addr, end)))
+		return -EPERM;
+
+	 /* arch_unmap() might do unmaps itself.  */
+	arch_unmap(mm, addr, end);
+
+	/* Find the first overlapping VMA */
+	vma = vma_find(&vmi, end);
+	if (vma) {
+		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+			return -ENOMEM;
+		vma = NULL;
+	}
 
 	/*
 	 * Private writable mapping: check memory availability
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 09/15] mm/mmap: Expand mmap_region() munmap call
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (7 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 08/15] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 10/15] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Open code the do_vmi_align_munmap() call so that it can be broken up
later in the series.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index e9858ca8bbd4..f5b33de4e717 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2915,6 +2915,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct vm_area_struct *next, *prev, *merge;
 	pgoff_t pglen = len >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
+	struct maple_tree mt_detach;
 	unsigned long end = addr + len;
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
@@ -2947,9 +2950,24 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/* Find the first overlapping VMA */
 	vma = vma_find(&vmi, end);
 	if (vma) {
-		if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(mt_detach);
+		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+		init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
+		if (vms_gather_munmap_vmas(&vms, &mas_detach))
+			return -ENOMEM;
+
+		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
 			return -ENOMEM;
+
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+		next = vms.next;
+		prev = vms.prev;
+		vma_prev(&vmi);
 		vma = NULL;
+	} else {
+		next = vma_next(&vmi);
+		prev = vma_prev(&vmi);
 	}
 
 	/*
@@ -2962,8 +2980,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	next = vma_next(&vmi);
-	prev = vma_prev(&vmi);
 	if (vm_flags & VM_SPECIAL) {
 		if (prev)
 			vma_iter_next_range(&vmi);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 10/15] mm/mmap: Reposition vma iterator in mmap_region()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (8 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 09/15] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 11/15] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of moving (or leaving) the vma iterator pointing at the previous
vma, leave it pointing at the insert location.  Pointing the vma
iterator at the insert location allows for a cleaner walk of the vma
tree for MAP_FIXED and the no expansion cases.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f5b33de4e717..ecf55d32e804 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2963,11 +2963,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vms_complete_munmap_vmas(&vms, &mas_detach);
 		next = vms.next;
 		prev = vms.prev;
-		vma_prev(&vmi);
 		vma = NULL;
 	} else {
 		next = vma_next(&vmi);
 		prev = vma_prev(&vmi);
+		if (prev)
+			vma_iter_next_range(&vmi);
 	}
 
 	/*
@@ -2980,11 +2981,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	if (vm_flags & VM_SPECIAL) {
-		if (prev)
-			vma_iter_next_range(&vmi);
+	if (vm_flags & VM_SPECIAL)
 		goto cannot_expand;
-	}
 
 	/* Attempt to expand an old mapping */
 	/* Check next */
@@ -3005,19 +3003,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		merge_start = prev->vm_start;
 		vma = prev;
 		vm_pgoff = prev->vm_pgoff;
-	} else if (prev) {
-		vma_iter_next_range(&vmi);
+		vma_prev(&vmi);
 	}
 
-	/* Actually expand, if possible */
-	if (vma &&
-	    !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
-		khugepaged_enter_vma(vma, vm_flags);
-		goto expanded;
+	if (vma) {
+		/* Actually expand, if possible */
+		if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
+			khugepaged_enter_vma(vma, vm_flags);
+			goto expanded;
+		}
+
+		/* If the expand fails, then reposition the vma iterator */
+		if (unlikely(vma == prev))
+			vma_iter_set(&vmi, addr);
 	}
 
-	if (vma == prev)
-		vma_iter_set(&vmi, addr);
 cannot_expand:
 
 	/*
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 11/15] mm/mmap: Track start and end of munmap in vma_munmap_struct
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (9 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 10/15] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 12/15] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Set the start and end address for munmap when the prev and next are
gathered.  This is needed to avoid incorrect addresses being used during
the vms_complete_munmap_vmas() function if the prev/next vma are
expanded.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/internal.h |  2 ++
 mm/mmap.c     | 13 +++++++++----
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 90cab15c3b81..b0300cb22353 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1492,6 +1492,8 @@ struct vma_munmap_struct {
 	struct list_head *uf;		/* Userfaultfd list_head */
 	unsigned long start;		/* Aligned start addr */
 	unsigned long end;		/* Aligned end addr */
+	unsigned long unmap_start;
+	unsigned long unmap_end;
 	int vma_count;			/* Number of vmas that will be removed */
 	unsigned long nr_pages;		/* Number of pages being removed */
 	unsigned long locked_vm;	/* Number of locked pages */
diff --git a/mm/mmap.c b/mm/mmap.c
index ecf55d32e804..5efcba084e12 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -525,6 +525,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->vma_count = 0;
 	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+	vms->unmap_start = FIRST_USER_ADDRESS;
+	vms->unmap_end = USER_PGTABLES_CEILING;
 }
 
 /*
@@ -2409,9 +2411,7 @@ static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
 	mas_set(mas, mt_start);
-	free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
-				 next ? next->vm_start : USER_PGTABLES_CEILING,
-				 mm_wr_locked);
+	free_pgtables(&tlb, mas, vma, start, end, mm_wr_locked);
 	tlb_finish_mmu(&tlb);
 }
 
@@ -2637,7 +2637,8 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	 */
 	mas_set(mas_detach, 1);
 	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
-		     vms->start, vms->end, vms->vma_count, !vms->unlock);
+		     vms->unmap_start, vms->unmap_end, vms->vma_count,
+		     !vms->unlock);
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
 	/* Stat accounting */
@@ -2699,6 +2700,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 			goto start_split_failed;
 	}
 	vms->prev = vma_prev(vms->vmi);
+	if (vms->prev)
+		vms->unmap_start = vms->prev->vm_end;
 
 	/*
 	 * Detach a range of VMAs from the mm. Using next as a temp variable as
@@ -2757,6 +2760,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	}
 
 	vms->next = vma_next(vms->vmi);
+	if (vms->next)
+		vms->unmap_end = vms->next->vm_start;
 
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 	/* Make sure no VMAs are about to be lost. */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 12/15] mm/mmap: Avoid zeroing vma tree in mmap_region()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (10 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 11/15] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 13/15] mm/mmap: Use PHYS_PFN " Liam R. Howlett
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of zeroing the vma tree and then overwriting the area, let the
area be overwritten and then clean up the gathered vmas using
vms_complete_munmap_vmas().

Temporarily keep track of the number of pages that will be removed and
reduce the charged amount.

This also drops the validate_mm() call in the vma_expand() function.
This is necessary as it would fail since the mm map_count would be
incorrect during a vma expansion, prior to the cleanup from
vms_complete_munmap_vmas().

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/internal.h |  1 +
 mm/mmap.c     | 65 +++++++++++++++++++++++++++++++--------------------
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b0300cb22353..2ad6310059eb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1502,6 +1502,7 @@ struct vma_munmap_struct {
 	unsigned long stack_vm;
 	unsigned long data_vm;
 	bool unlock;			/* Unlock after the munmap */
+	bool cleared_ptes;		/* If the PTE are cleared already */
 };
 
 void __meminit __init_single_page(struct page *page, unsigned long pfn,
diff --git a/mm/mmap.c b/mm/mmap.c
index 5efcba084e12..b7f47964aaf0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -406,17 +406,21 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 }
 
 static unsigned long count_vma_pages_range(struct mm_struct *mm,
-		unsigned long addr, unsigned long end)
+		unsigned long addr, unsigned long end,
+		unsigned long *nr_accounted)
 {
 	VMA_ITERATOR(vmi, mm, addr);
 	struct vm_area_struct *vma;
 	unsigned long nr_pages = 0;
 
+	*nr_accounted = 0;
 	for_each_vma_range(vmi, vma, end) {
 		unsigned long vm_start = max(addr, vma->vm_start);
 		unsigned long vm_end = min(end, vma->vm_end);
 
 		nr_pages += PHYS_PFN(vm_end - vm_start);
+		if (vma->vm_flags & VM_ACCOUNT)
+			*nr_accounted += PHYS_PFN(vm_end - vm_start);
 	}
 
 	return nr_pages;
@@ -527,6 +531,7 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
 	vms->unmap_start = FIRST_USER_ADDRESS;
 	vms->unmap_end = USER_PGTABLES_CEILING;
+	vms->cleared_ptes = false;
 }
 
 /*
@@ -735,7 +740,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma_iter_store(vmi, vma);
 
 	vma_complete(&vp, vmi, vma->vm_mm);
-	validate_mm(vma->vm_mm);
 	return 0;
 
 nomem:
@@ -2631,6 +2635,8 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	if (vms->unlock)
 		mmap_write_downgrade(mm);
 
+	if (vms->cleared_ptes)
+		goto cleared_ptes;
 	/*
 	 * We can free page tables without write-locking mmap_lock because VMAs
 	 * were isolated before we downgraded mmap_lock.
@@ -2639,6 +2645,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
 		     vms->unmap_start, vms->unmap_end, vms->vma_count,
 		     !vms->unlock);
+cleared_ptes:
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
 	/* Stat accounting */
@@ -2927,24 +2934,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	unsigned long merge_start = addr, merge_end = end;
 	bool writable_file_mapping = false;
 	pgoff_t vm_pgoff;
-	int error;
+	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
+	unsigned long nr_pages, nr_accounted;
 
-	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
-		unsigned long nr_pages;
-
-		/*
-		 * MAP_FIXED may remove pages of mappings that intersects with
-		 * requested mapping. Account for the pages it would unmap.
-		 */
-		nr_pages = count_vma_pages_range(mm, addr, end);
-
-		if (!may_expand_vm(mm, vm_flags,
-					(len >> PAGE_SHIFT) - nr_pages))
-			return -ENOMEM;
-	}
+	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
 
+	/* Check against address space limit. */
+	/*
+	 * MAP_FIXED may remove pages of mappings that intersects with requested
+	 * mapping. Account for the pages it would unmap.
+	 */
+	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
 		return -EPERM;
@@ -2962,14 +2964,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		if (vms_gather_munmap_vmas(&vms, &mas_detach))
 			return -ENOMEM;
 
-		if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
-			return -ENOMEM;
-
-		vms_complete_munmap_vmas(&vms, &mas_detach);
 		next = vms.next;
 		prev = vms.prev;
 		vma = NULL;
 	} else {
+		/* Minimal setup of vms */
+		vms.nr_pages = 0;
 		next = vma_next(&vmi);
 		prev = vma_prev(&vmi);
 		if (prev)
@@ -2981,8 +2981,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	if (accountable_mapping(file, vm_flags)) {
 		charged = len >> PAGE_SHIFT;
+		charged -= nr_accounted;
 		if (security_vm_enough_memory_mm(mm, charged))
-			return -ENOMEM;
+			goto abort_munmap;
+		vms.nr_accounted = 0;
 		vm_flags |= VM_ACCOUNT;
 	}
 
@@ -3031,10 +3033,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * not unmapped, but the maps are removed from the list.
 	 */
 	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
+	if (!vma)
 		goto unacct_error;
-	}
 
 	vma_iter_config(&vmi, addr, end);
 	vma_set_range(vma, addr, end, pgoff);
@@ -3043,6 +3043,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	if (file) {
 		vma->vm_file = get_file(file);
+		/* This may map PTE, so ensure there are not existing PTE */
+		if (vms.nr_pages) {
+			mas_set(&mas_detach, 1);
+			unmap_region(mm, &mas_detach, vms.vma, prev, next,
+				     vms.unmap_start, vms.unmap_end,
+				     vms.vma_count, /*mm_wr_locked = */ true);
+			vms.cleared_ptes = true;
+		}
 		error = call_mmap(file, vma);
 		if (error)
 			goto unmap_and_free_vma;
@@ -3133,6 +3141,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 expanded:
 	perf_event_mmap(vma);
 
+	if (vms.nr_pages)
+		vms_complete_munmap_vmas(&vms, &mas_detach);
+
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
@@ -3181,6 +3192,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
+
+abort_munmap:
+	if (vms.nr_pages)
+		abort_munmap_vmas(&mas_detach);
 	validate_mm(mm);
 	return error;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 13/15] mm/mmap: Use PHYS_PFN in mmap_region()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (11 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 12/15] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-25 19:11 ` [PATCH v2 14/15] mm/mmap: Use vms accounted pages " Liam R. Howlett
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Instead of shifting the length by PAGE_SIZE, use PHYS_PFN.  Also use the
existing local variable everywhere instead of some of the time.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index b7f47964aaf0..f3edabf83975 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2925,7 +2925,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	struct vm_area_struct *next, *prev, *merge;
-	pgoff_t pglen = len >> PAGE_SHIFT;
+	pgoff_t pglen = PHYS_PFN(len);
 	unsigned long charged = 0;
 	struct vma_munmap_struct vms;
 	struct ma_state mas_detach;
@@ -2945,7 +2945,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * MAP_FIXED may remove pages of mappings that intersects with requested
 	 * mapping. Account for the pages it would unmap.
 	 */
-	if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
 		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
@@ -2980,7 +2980,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 * Private writable mapping: check memory availability
 	 */
 	if (accountable_mapping(file, vm_flags)) {
-		charged = len >> PAGE_SHIFT;
+		charged = pglen;
 		charged -= nr_accounted;
 		if (security_vm_enough_memory_mm(mm, charged))
 			goto abort_munmap;
@@ -3144,14 +3144,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (vms.nr_pages)
 		vms_complete_munmap_vmas(&vms, &mas_detach);
 
-	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
+	vm_stat_account(mm, vm_flags, pglen);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
 					is_vm_hugetlb_page(vma) ||
 					vma == get_gate_vma(current->mm))
 			vm_flags_clear(vma, VM_LOCKED_MASK);
 		else
-			mm->locked_vm += (len >> PAGE_SHIFT);
+			mm->locked_vm += pglen;
 	}
 
 	if (file)
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 14/15] mm/mmap: Use vms accounted pages in mmap_region()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (12 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 13/15] mm/mmap: Use PHYS_PFN " Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-26 16:32   ` Kees Cook
  2024-06-25 19:11 ` [PATCH v2 15/15] mm/mmap: Move may_expand_vm() check " Liam R. Howlett
  2024-06-26 20:58 ` [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Andrew Morton
  15 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Change from nr_pages variable to vms.nr_accounted for the charged pages
calculation.  This is necessary for a future patch.

This also avoids checking security_vm_enough_memory_mm() if the amount
of memory won't change.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Kees Cook <kees@kernel.org>
---
 mm/mmap.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index f3edabf83975..adb0bb5ea344 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2970,6 +2970,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	} else {
 		/* Minimal setup of vms */
 		vms.nr_pages = 0;
+		vms.nr_accounted = 0;
 		next = vma_next(&vmi);
 		prev = vma_prev(&vmi);
 		if (prev)
@@ -2981,9 +2982,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	if (accountable_mapping(file, vm_flags)) {
 		charged = pglen;
-		charged -= nr_accounted;
-		if (security_vm_enough_memory_mm(mm, charged))
+		charged -= vms.nr_accounted;
+		if (charged && security_vm_enough_memory_mm(mm, charged))
 			goto abort_munmap;
+
 		vms.nr_accounted = 0;
 		vm_flags |= VM_ACCOUNT;
 	}
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 15/15] mm/mmap: Move may_expand_vm() check in mmap_region()
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (13 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 14/15] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-06-25 19:11 ` Liam R. Howlett
  2024-06-26 20:58 ` [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Andrew Morton
  15 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-25 19:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook,
	Liam R. Howlett

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
call, so use it instead of looping over the vmas twice.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 36 ++++--------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index adb0bb5ea344..a310b05a01c2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -405,27 +405,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
 }
 
-static unsigned long count_vma_pages_range(struct mm_struct *mm,
-		unsigned long addr, unsigned long end,
-		unsigned long *nr_accounted)
-{
-	VMA_ITERATOR(vmi, mm, addr);
-	struct vm_area_struct *vma;
-	unsigned long nr_pages = 0;
-
-	*nr_accounted = 0;
-	for_each_vma_range(vmi, vma, end) {
-		unsigned long vm_start = max(addr, vma->vm_start);
-		unsigned long vm_end = min(end, vma->vm_end);
-
-		nr_pages += PHYS_PFN(vm_end - vm_start);
-		if (vma->vm_flags & VM_ACCOUNT)
-			*nr_accounted += PHYS_PFN(vm_end - vm_start);
-	}
-
-	return nr_pages;
-}
-
 static void __vma_link_file(struct vm_area_struct *vma,
 			    struct address_space *mapping)
 {
@@ -2936,17 +2915,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	pgoff_t vm_pgoff;
 	int error = -ENOMEM;
 	VMA_ITERATOR(vmi, mm, addr);
-	unsigned long nr_pages, nr_accounted;
-
-	nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
-
-	/* Check against address space limit. */
-	/*
-	 * MAP_FIXED may remove pages of mappings that intersects with requested
-	 * mapping. Account for the pages it would unmap.
-	 */
-	if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
-		return -ENOMEM;
 
 	if (unlikely(!can_modify_mm(mm, addr, end)))
 		return -EPERM;
@@ -2977,6 +2945,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			vma_iter_next_range(&vmi);
 	}
 
+	/* Check against address space limit. */
+	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
+		goto abort_munmap;
+
 	/*
 	 * Private writable mapping: check memory availability
 	 */
-- 
2.43.0



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 14/15] mm/mmap: Use vms accounted pages in mmap_region()
  2024-06-25 19:11 ` [PATCH v2 14/15] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-06-26 16:32   ` Kees Cook
  2024-06-26 18:04     ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Kees Cook @ 2024-06-26 16:32 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	linux-security-module

On Tue, Jun 25, 2024 at 03:11:44PM -0400, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> 
> Change from nr_pages variable to vms.nr_accounted for the charged pages
> calculation.  This is necessary for a future patch.
> 
> This also avoids checking security_vm_enough_memory_mm() if the amount
> of memory won't change.

Is there a reason for making this change? (I.e. why not leave off the
"charged" test?)

Looking at the callbacks in the LSM, only capabilities and SELinux are
hooking this, and both are checking whether a process has elevated privs
and are ignoring the "pages" argument entirely, so I'm not sure it's
safe to change the logic for whether to make the call based on an unused
argument (i.e. the LSM may want to _always_ know about this). On the
other hand, it looks like it's purely an accounting issue, and if the
page count didn't change, there's no reason to bother calling into all
this to make no changes to the accounting.

I've added the LSM list to CC...

-Kees

> 
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Cc: Kees Cook <kees@kernel.org>
> ---
>  mm/mmap.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f3edabf83975..adb0bb5ea344 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2970,6 +2970,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	} else {
>  		/* Minimal setup of vms */
>  		vms.nr_pages = 0;
> +		vms.nr_accounted = 0;
>  		next = vma_next(&vmi);
>  		prev = vma_prev(&vmi);
>  		if (prev)
> @@ -2981,9 +2982,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	 */
>  	if (accountable_mapping(file, vm_flags)) {
>  		charged = pglen;
> -		charged -= nr_accounted;
> -		if (security_vm_enough_memory_mm(mm, charged))
> +		charged -= vms.nr_accounted;
> +		if (charged && security_vm_enough_memory_mm(mm, charged))
>  			goto abort_munmap;
> +
>  		vms.nr_accounted = 0;
>  		vm_flags |= VM_ACCOUNT;
>  	}
> -- 
> 2.43.0
> 

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 14/15] mm/mmap: Use vms accounted pages in mmap_region()
  2024-06-26 16:32   ` Kees Cook
@ 2024-06-26 18:04     ` Liam R. Howlett
  2024-06-26 18:45       ` Kees Cook
  0 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-26 18:04 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-mm, Andrew Morton, Suren Baghdasaryan, Vlastimil Babka,
	Lorenzo Stoakes, Matthew Wilcox, sidhartha.kumar,
	Paul E . McKenney, Bert Karwatzki, Jiri Olsa, linux-kernel,
	linux-security-module

* Kees Cook <kees@kernel.org> [240626 12:32]:
> On Tue, Jun 25, 2024 at 03:11:44PM -0400, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > 
> > Change from nr_pages variable to vms.nr_accounted for the charged pages
> > calculation.  This is necessary for a future patch.
> > 
> > This also avoids checking security_vm_enough_memory_mm() if the amount
> > of memory won't change.
> 
> Is there a reason for making this change? (I.e. why not leave off the
> "charged" test?)

Before, the munmap() completed prior to mmap()'ing the MAP_FIXED vma.
If we don't remove the nr_accounted from the charged, we risk hitting
the maximum limit.

> 
> Looking at the callbacks in the LSM, only capabilities and SELinux are
> hooking this, and both are checking whether a process has elevated privs
> and are ignoring the "pages" argument entirely, so I'm not sure it's
> safe to change the logic for whether to make the call based on an unused
> argument (i.e. the LSM may want to _always_ know about this). On the
> other hand, it looks like it's purely an accounting issue, and if the
> page count didn't change, there's no reason to bother calling into all
> this to make no changes to the accounting.

I didn't see any reason not to avoid the call, but your statement is
valid.  I didn't see anything looking at the callbacks that would have
issue with skipping it - but I'd like to hear what LSM has to say.

I don't have any objections to removing the extra check, if anyone
thinks it could be an issue.

> 
> I've added the LSM list to CC...

Thank you, and thanks for looking at this.

> 
> -Kees
> 
> > 
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Cc: Kees Cook <kees@kernel.org>
> > ---
> >  mm/mmap.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f3edabf83975..adb0bb5ea344 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2970,6 +2970,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	} else {
> >  		/* Minimal setup of vms */
> >  		vms.nr_pages = 0;
> > +		vms.nr_accounted = 0;
> >  		next = vma_next(&vmi);
> >  		prev = vma_prev(&vmi);
> >  		if (prev)
> > @@ -2981,9 +2982,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	 */
> >  	if (accountable_mapping(file, vm_flags)) {
> >  		charged = pglen;
> > -		charged -= nr_accounted;
> > -		if (security_vm_enough_memory_mm(mm, charged))
> > +		charged -= vms.nr_accounted;
> > +		if (charged && security_vm_enough_memory_mm(mm, charged))
> >  			goto abort_munmap;
> > +
> >  		vms.nr_accounted = 0;
> >  		vm_flags |= VM_ACCOUNT;
> >  	}
> > -- 
> > 2.43.0
> > 
> 
> -- 
> Kees Cook


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 14/15] mm/mmap: Use vms accounted pages in mmap_region()
  2024-06-26 18:04     ` Liam R. Howlett
@ 2024-06-26 18:45       ` Kees Cook
  0 siblings, 0 replies; 31+ messages in thread
From: Kees Cook @ 2024-06-26 18:45 UTC (permalink / raw)
  To: Liam R. Howlett, linux-mm, Andrew Morton, Suren Baghdasaryan,
	Vlastimil Babka, Lorenzo Stoakes, Matthew Wilcox,
	sidhartha.kumar, Paul E . McKenney, Bert Karwatzki, Jiri Olsa,
	linux-kernel, linux-security-module

On Wed, Jun 26, 2024 at 02:04:53PM -0400, Liam R. Howlett wrote:
> * Kees Cook <kees@kernel.org> [240626 12:32]:
> > On Tue, Jun 25, 2024 at 03:11:44PM -0400, Liam R. Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > 
> > > Change from nr_pages variable to vms.nr_accounted for the charged pages
> > > calculation.  This is necessary for a future patch.
> > > 
> > > This also avoids checking security_vm_enough_memory_mm() if the amount
> > > of memory won't change.
> > 
> > Is there a reason for making this change? (I.e. why not leave off the
> > "charged" test?)
> 
> Before, the munmap() completed prior to mmap()'ing the MAP_FIXED vma.
> If we don't remove the nr_accounted from the charged, we risk hitting
> the maximum limit.
> 
> > 
> > Looking at the callbacks in the LSM, only capabilities and SELinux are
> > hooking this, and both are checking whether a process has elevated privs
> > and are ignoring the "pages" argument entirely, so I'm not sure it's
> > safe to change the logic for whether to make the call based on an unused
> > argument (i.e. the LSM may want to _always_ know about this). On the
> > other hand, it looks like it's purely an accounting issue, and if the
> > page count didn't change, there's no reason to bother calling into all
> > this to make no changes to the accounting.
> 
> I didn't see any reason not to avoid the call, but your statement is
> valid.  I didn't see anything looking at the callbacks that would have
> issue with skipping it - but I'd like to hear what LSM has to say.
> 
> I don't have any objections to removing the extra check, if anyone
> thinks it could be an issue.
> 
> > 
> > I've added the LSM list to CC...
> 
> Thank you, and thanks for looking at this.

Sure! And barring any other feedback, I think this is safe, given the
changes to the accounting logic: no change means nothing to check.

Reviewed-by: Kees Cook <kees@kernel.org>

-Kees

> 
> > 
> > -Kees
> > 
> > > 
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > Cc: Kees Cook <kees@kernel.org>
> > > ---
> > >  mm/mmap.c | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index f3edabf83975..adb0bb5ea344 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2970,6 +2970,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	} else {
> > >  		/* Minimal setup of vms */
> > >  		vms.nr_pages = 0;
> > > +		vms.nr_accounted = 0;
> > >  		next = vma_next(&vmi);
> > >  		prev = vma_prev(&vmi);
> > >  		if (prev)
> > > @@ -2981,9 +2982,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > >  	 */
> > >  	if (accountable_mapping(file, vm_flags)) {
> > >  		charged = pglen;
> > > -		charged -= nr_accounted;
> > > -		if (security_vm_enough_memory_mm(mm, charged))
> > > +		charged -= vms.nr_accounted;
> > > +		if (charged && security_vm_enough_memory_mm(mm, charged))
> > >  			goto abort_munmap;
> > > +
> > >  		vms.nr_accounted = 0;
> > >  		vm_flags |= VM_ACCOUNT;
> > >  	}
> > > -- 
> > > 2.43.0
> > > 
> > 
> > -- 
> > Kees Cook

-- 
Kees Cook


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
  2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
                   ` (14 preceding siblings ...)
  2024-06-25 19:11 ` [PATCH v2 15/15] mm/mmap: Move may_expand_vm() check " Liam R. Howlett
@ 2024-06-26 20:58 ` Andrew Morton
  2024-06-27  1:15   ` Liam R. Howlett
  15 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2024-06-26 20:58 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Tue, 25 Jun 2024 15:11:30 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> It is now possible to walk the vma tree using the rcu read locks and is
> beneficial to do so to reduce lock contention.  Doing so while a
> MAP_FIXED mapping is executing means that a reader may see a gap in the
> vma tree that should never logically exist - and does not when using the
> mmap lock in read mode.  The temporal gap exists because mmap_region()
> calls munmap() prior to installing the new mapping.

What are the consequences when this race hits?  IOW, why do we need to
change anything?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
  2024-06-26 20:58 ` [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Andrew Morton
@ 2024-06-27  1:15   ` Liam R. Howlett
  2024-06-27  1:28     ` Andrew Morton
  0 siblings, 1 reply; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-27  1:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Andrew Morton <akpm@linux-foundation.org> [240626 16:59]:
> On Tue, 25 Jun 2024 15:11:30 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> 
> > It is now possible to walk the vma tree using the rcu read locks and is
> > beneficial to do so to reduce lock contention.  Doing so while a
> > MAP_FIXED mapping is executing means that a reader may see a gap in the
> > vma tree that should never logically exist - and does not when using the
> > mmap lock in read mode.  The temporal gap exists because mmap_region()
> > calls munmap() prior to installing the new mapping.
> 
> What are the consequences when this race hits?  IOW, why do we need to
> change anything?
> 

In the (near) future, we want to walk the vma tree to produce
/proc/<pid>/maps.  Without this change we will see the temporal gap and
expose it to the user.  This series was initially sent to Suren as part
of his patch set.

We also have the new interface for an ioctl request to a vma at or above
an address. I had highlighted that an rcu reader would be ideal, but
proved too difficult at this time. These patches by Andrii are currently
not using the rcu reading method as this and a per-vma locking
clarification are needed.

Since there were two users for this code, I decided to send it out
before the other patches.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
  2024-06-27  1:15   ` Liam R. Howlett
@ 2024-06-27  1:28     ` Andrew Morton
  2024-06-27 13:31       ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2024-06-27  1:28 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: linux-mm, Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

On Wed, 26 Jun 2024 21:15:18 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:

> * Andrew Morton <akpm@linux-foundation.org> [240626 16:59]:
> > On Tue, 25 Jun 2024 15:11:30 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> > 
> > > It is now possible to walk the vma tree using the rcu read locks and is
> > > beneficial to do so to reduce lock contention.  Doing so while a
> > > MAP_FIXED mapping is executing means that a reader may see a gap in the
> > > vma tree that should never logically exist - and does not when using the
> > > mmap lock in read mode.  The temporal gap exists because mmap_region()
> > > calls munmap() prior to installing the new mapping.
> > 
> > What are the consequences when this race hits?  IOW, why do we need to
> > change anything?
> > 
> 
> In the (near) future, we want to walk the vma tree to produce
> /proc/<pid>/maps.  Without this change we will see the temporal gap and
> expose it to the user.  This series was initially sent to Suren as part
> of his patch set.
> 
> We also have the new interface for an ioctl request to a vma at or above
> an address. I had highlighted that an rcu reader would be ideal, but
> proved too difficult at this time. These patches by Andrii are currently
> not using the rcu reading method as this and a per-vma locking
> clarification are needed.
> 
> Since there were two users for this code, I decided to send it out
> before the other patches.

OK, thanks.  We're approaching -rc6 and things are a bit sketchy so I'm
inclined to hold this off until the next cycle, unless there's urgency?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
  2024-06-27  1:28     ` Andrew Morton
@ 2024-06-27 13:31       ` Liam R. Howlett
  0 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-06-27 13:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Suren Baghdasaryan, Vlastimil Babka, Lorenzo Stoakes,
	Matthew Wilcox, sidhartha.kumar, Paul E . McKenney,
	Bert Karwatzki, Jiri Olsa, linux-kernel, Kees Cook

* Andrew Morton <akpm@linux-foundation.org> [240626 21:28]:
> On Wed, 26 Jun 2024 21:15:18 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> 
> > * Andrew Morton <akpm@linux-foundation.org> [240626 16:59]:
> > > On Tue, 25 Jun 2024 15:11:30 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
> > > 
> > > > It is now possible to walk the vma tree using the rcu read locks and is
> > > > beneficial to do so to reduce lock contention.  Doing so while a
> > > > MAP_FIXED mapping is executing means that a reader may see a gap in the
> > > > vma tree that should never logically exist - and does not when using the
> > > > mmap lock in read mode.  The temporal gap exists because mmap_region()
> > > > calls munmap() prior to installing the new mapping.
> > > 
> > > What are the consequences when this race hits?  IOW, why do we need to
> > > change anything?
> > > 
> > 
> > In the (near) future, we want to walk the vma tree to produce
> > /proc/<pid>/maps.  Without this change we will see the temporal gap and
> > expose it to the user.  This series was initially sent to Suren as part
> > of his patch set.
> > 
> > We also have the new interface for an ioctl request to a vma at or above
> > an address. I had highlighted that an rcu reader would be ideal, but
> > proved too difficult at this time. These patches by Andrii are currently
> > not using the rcu reading method as this and a per-vma locking
> > clarification are needed.
> > 
> > Since there were two users for this code, I decided to send it out
> > before the other patches.
> 
> OK, thanks.  We're approaching -rc6 and things are a bit sketchy so I'm
> inclined to hold this off until the next cycle, unless there's urgency?
> 

There is no urgency.  I'm more than happy to hold off merging to get a
full cycle of testing.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
@ 2024-07-04 21:50 Bert Karwatzki
  0 siblings, 0 replies; 31+ messages in thread
From: Bert Karwatzki @ 2024-07-04 21:50 UTC (permalink / raw)
  To: Liam R . Howlett; +Cc: Bert Karwatzki, Andrew Morton, linux-mm

I just did test the v3 patchset on top of linux-next-20240703 with
`stress-ng --vm-segv 16`. In about 5 minutes of testing no errors occured.
This seems to be a good sign especially since testing the v2 patchset yielded
more than a million errors in 30 seconds.

Bert Karwatzki




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
@ 2024-07-04 21:15 Bert Karwatzki
  0 siblings, 0 replies; 31+ messages in thread
From: Bert Karwatzki @ 2024-07-04 21:15 UTC (permalink / raw)
  To: Liam R . Howlett; +Cc: Bert Karwatzki, Andrew Morton, linux-mm

Also, this time the kernel was compiled with CONFIG_DEBUG_VM_MAPLE_TREE=y,
but /var/log/kern.log contained no related warnings.

Bert Karwatzki




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
@ 2024-07-04 21:07 Bert Karwatzki
  0 siblings, 0 replies; 31+ messages in thread
From: Bert Karwatzki @ 2024-07-04 21:07 UTC (permalink / raw)
  To: Liam R . Howlett; +Cc: Bert Karwatzki, Andrew Morton, linux-mm

I tested the v2 patch on top of linux-next-20240703 with stress-ng --vm-segv 16 and
got more than a million "Bad rss-counter state" errors plus an invalid opcode error!
I'm rebooting now to test v3.

Bert Karwatzki


[ T1359] BUG: Bad rss-counter state mm:00000000e5421690 type:MM_ANONPAGES val:370
[...] Here are more than 10^6 (~2^20) lines of "Bad rss-counter-state"
[ T1359] BUG: Bad rss-counter state mm:00000000e5421690 type:MM_SHMEMPAGES val:27
[T24203] page: refcount:542376 mapcount:542374 mapping:00000000ba179a51 index:0x0 pfn:0x29594e
[T24203] memcg:ffff9bc0424b6800
[T24203] aops:shmem_aops ino:a678
[T24203] flags: 0x400000000004012d(locked|referenced|uptodate|lru|active|swapbacked|zone=2)
[T24203] raw: 400000000004012d ffffd864c574a908 ffffd864c776fa48 ffff9bc1a0681040
[T24203] raw: 0000000000000000 0000000000000000 000846a8000846a5 ffff9bc0424b6800
[T24203] page dumped because: VM_BUG_ON_FOLIO(folio_mapped(folio))
[T24203] ------------[ cut here ]------------
[T24203] kernel BUG at mm/filemap.c:162!
[T24203] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[T24203] CPU: 1 UID: 0 PID: 24203 Comm: stress-ng Not tainted 6.10.0-rc6-next-20240703-00016-g09a756327684 #1417
[T24203] Hardware name: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021
[T24203] RIP: 0010:filemap_unaccount_folio+0xcf/0x170
[T24203] Code: 00 00 48 8b 06 a8 40 0f 84 a3 00 00 00 8b 43 50 83 c0 01 85 c0 0f 8e 66 ff ff ff 48 c7 c6 48 e4 aa bc 48 89 df e8 31 32 03 00 <0f> 0b 5b 5d 41 5c e9 71 7f 92 00 44 89 e2 be 17 00 00 00 48 89 df
[T24203] RSP: 0018:ffffb1a4e01aba88 EFLAGS: 00010046
[T24203] RAX: 0000000000000039 RBX: ffffd864ca565380 RCX: 0000000000000027
[T24203] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9bceee657780
[T24203] RBP: ffff9bc1a0681040 R08: 0000000000000000 R09: 0000000000000003
[T24203] R10: ffffb1a4e01ab940 R11: ffffffffbcc82940 R12: ffff9bc1a0681040
[T24203] R13: 0000000000000000 R14: ffff9bc1a0681048 R15: ffffd864ca565380
[T24203] FS:  00007f944cb0df40(0000) GS:ffff9bceee640000(0000) knlGS:0000000000000000
[T24203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[T24203] CR2: 00007ffe0ec9f948 CR3: 0000000150be4000 CR4: 0000000000750ef0
[T24203] PKRU: 55555554
[T24203] Call Trace:
[T24203]  <TASK>
[T24203]  ? die+0x31/0x80
[T24203]  ? do_trap+0xf1/0x100
[T24203]  ? filemap_unaccount_folio+0xcf/0x170
[T24203]  ? do_error_trap+0x60/0x80
[T24203]  ? filemap_unaccount_folio+0xcf/0x170
[T24203]  ? exc_invalid_op+0x4d/0x70
[T24203]  ? filemap_unaccount_folio+0xcf/0x170
[T24203]  ? asm_exc_invalid_op+0x1a/0x20
[T24203]  ? filemap_unaccount_folio+0xcf/0x170
[T24203]  ? filemap_unaccount_folio+0xcf/0x170
[T24203]  ? __filemap_remove_folio+0x33/0x1a0
[T24203]  ? xas_find+0x159/0x1c0
[T24203]  ? srso_alias_return_thunk+0x5/0xfbef5
[T24203]  ? find_lock_entries+0x229/0x330
[T24203]  ? srso_alias_return_thunk+0x5/0xfbef5
[T24203]  ? unmap_mapping_folio+0x75/0x130
[T24203]  ? filemap_remove_folio+0x3c/0xa0
[T24203]  ? truncate_inode_folio+0x1e/0x30
[T24203]  ? shmem_undo_range+0x15c/0x6f0
[T24203]  ? bio_integrity_unpin_bvec+0xf/0x60
[T24203]  ? shmem_evict_inode+0x109/0x260
[T24203]  ? swake_up_locked+0x50/0x50
[T24203]  ? evict+0xbf/0x1c0
[T24203]  ? __dentry_kill+0x6c/0x170
[T24203]  ? dput+0xe6/0x1b0
[T24203]  ? __fput+0x13c/0x2c0
[T24203]  ? task_work_run+0x57/0x80
[T24203]  ? syscall_exit_to_user_mode+0x196/0x1a0
[T24203]  ? do_syscall_64+0x6b/0x170
[T24203]  ? entry_SYSCALL_64_after_hwframe+0x55/0x5d
[T24203]  </TASK>
[T24203] Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device rfcomm cmac bnep nls_ascii nls_cp437 vfat fat snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_scodec_component btusb btrtl btintel snd_hda_intel btbcm snd_intel_dspcfg btmtk snd_hda_codec snd_soc_dmic snd_acp3x_rn uvcvideo snd_acp3x_pdm_dma bluetooth amd_atl snd_hwdep snd_soc_core videobuf2_vmalloc snd_hda_core uvc videobuf2_memops videobuf2_v4l2 snd_pcm_oss videodev snd_mixer_oss snd_pcm snd_rn_pci_acp3x snd_acp_config videobuf2_common snd_timer msi_wmi snd_soc_acpi ecdh_generic ecc mc sparse_keymap snd edac_mce_amd wmi_bmof ccp soundcore snd_pci_acp3x k10temp ac battery button hid_sensor_gyro_3d hid_sensor_als hid_sensor_magn_3d hid_sensor_accel_3d hid_sensor_prox joydev hid_sensor_trigger industrialio_triggered_buffer kfifo_buf industrialio amd_pmc hid_sensor_iio_common evdev hid_multitouch serio_raw mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76
[T24203]  mac80211 libarc4 cfg80211 rfkill msr fuse nvme_fabrics efi_pstore configfs efivarfs autofs4 ext4 crc32c_generic mbcache jbd2 usbhid amdgpu i2c_algo_bit drm_ttm_helper ttm xhci_pci drm_exec drm_suballoc_helper xhci_hcd amdxcp drm_buddy hid_sensor_hub usbcore gpu_sched nvme mfd_core hid_generic crc32c_intel psmouse amd_sfh i2c_piix4 drm_display_helper usb_common nvme_core r8169 crc16 i2c_hid_acpi i2c_hid hid i2c_designware_platform i2c_designware_core
[T24203] ---[ end trace 0000000000000000 ]---
[T24203] RIP: 0010:filemap_unaccount_folio+0xcf/0x170
[T24203] Code: 00 00 48 8b 06 a8 40 0f 84 a3 00 00 00 8b 43 50 83 c0 01 85 c0 0f 8e 66 ff ff ff 48 c7 c6 48 e4 aa bc 48 89 df e8 31 32 03 00 <0f> 0b 5b 5d 41 5c e9 71 7f 92 00 44 89 e2 be 17 00 00 00 48 89 df
[T24203] RSP: 0018:ffffb1a4e01aba88 EFLAGS: 00010046
[T24203] RAX: 0000000000000039 RBX: ffffd864ca565380 RCX: 0000000000000027
[T24203] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9bceee657780
[T24203] RBP: ffff9bc1a0681040 R08: 0000000000000000 R09: 0000000000000003
[T24203] R10: ffffb1a4e01ab940 R11: ffffffffbcc82940 R12: ffff9bc1a0681040
[T24203] R13: 0000000000000000 R14: ffff9bc1a0681048 R15: ffffd864ca565380
[T24203] FS:  00007f944cb0df40(0000) GS:ffff9bceee640000(0000) knlGS:0000000000000000
[T24203] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[T24203] CR2: 00007ffe0ec9f948 CR3: 0000000150be4000 CR4: 0000000000750ef0
[T24203] PKRU: 55555554



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
  2024-07-03 14:30 Bert Karwatzki
@ 2024-07-04 18:31 ` Liam R. Howlett
  0 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-04 18:31 UTC (permalink / raw)
  To: Bert Karwatzki; +Cc: Andrew Morton, linux-mm

* Bert Karwatzki <spasswolf@web.de> [240703 10:30]:
> 
> > Thanks.  Is this the first error printed in the kern.log?
> 
> Yes. Uptime was about an hour when it occurred.
> 
> > It could very likely be me, but I don't know how the count would have
> > errors so late in the process life cycle - do you have
> > CONFIG_DEBUG_VM_MAPLE_TREE enabled?  This would check the count against
> > the tree on modifications and would narrow down where things could have
> > gone wrong.
> 
> Unfortunately, the panicking kernel was compiled without CONFIG_DEBUG_VM_MAPLE_TREE.
> After the error I recompiled it with that option enabled. This kernel is running for
> about 3 hours without the error reappearing.
> 

I sent v3 after producing an error with stress-ng --vm-sigv, which
unmaps the entire process space.  That might be at play with what you
saw, but I'm still unsure if your issue is fixed by this change.  I was
unable to reproduce the issue you saw against the mm-unstable and
mm-stable branch, but that might just mean it's rare.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
@ 2024-07-03 14:30 Bert Karwatzki
  2024-07-04 18:31 ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Bert Karwatzki @ 2024-07-03 14:30 UTC (permalink / raw)
  To: Liam R . Howlett; +Cc: Bert Karwatzki, Andrew Morton, linux-mm


> Thanks.  Is this the first error printed in the kern.log?

Yes. Uptime was about an hour when it occurred.

> It could very likely be me, but I don't know how the count would have
> errors so late in the process life cycle - do you have
> CONFIG_DEBUG_VM_MAPLE_TREE enabled?  This would check the count against
> the tree on modifications and would narrow down where things could have
> gone wrong.

Unfortunately, the panicking kernel was compiled without CONFIG_DEBUG_VM_MAPLE_TREE.
After the error I recompiled it with that option enabled. This kernel is running for
about 3 hours without the error reappearing.

Bert Karwatzki



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
  2024-07-03 10:13 Bert Karwatzki
@ 2024-07-03 13:53 ` Liam R. Howlett
  0 siblings, 0 replies; 31+ messages in thread
From: Liam R. Howlett @ 2024-07-03 13:53 UTC (permalink / raw)
  To: Bert Karwatzki; +Cc: Andrew Morton, linux-mm

* Bert Karwatzki <spasswolf@web.de> [240703 06:14]:
> If been running the fixed patchset on top of linux-next for a few days now, so far 
> without error but then I ran into this. After this the system went into a kernel panic
> (freeze, flashing capslock) this is the last message preserverd in /var/log/kern.log. I
> tried to emergency sync using magic sysrq, but that did not work so the actual panic message
> is lost. I do not know how (or if) this is related to the patch set. 

Thanks.  Is this the first error printed in the kern.log?

> The kernel used is linux-next-20240703 plus your v2 patchset plus one additional unrelated patch
> (just #ifdef CONFIG_OF in drivers/pci/bus.c related to 
> https://lore.kernel.org/all/20240612082019.19161-4-brgl@bgdev.pl/#t).


It could very likely be me, but I don't know how the count would have
errors so late in the process life cycle - do you have
CONFIG_DEBUG_VM_MAPLE_TREE enabled?  This would check the count against
the tree on modifications and would narrow down where things could have
gone wrong.


> 
> 
> [ T8516] show_signal_msg: 16 callbacks suppressed
> [ T8516] Isolated Web Co[8516]: segfault at 0 ip 00007f8c1f55fbe5 sp 00007ffcc2b97660 error 6 in libxul.so[4f98be5,7f8c1a686000+5f96000] likely on CPU 14 (core 7, socket 0)

This is firefox again.

> [ T8516] Code: 48 8d 0d 63 a3 3c 01 48 89 08 c7 04 25 00 00 00 00 00 00 00 00 0f 0b 48 8b 05 47 1a e2 03 48 8d 0d 38 99 30 01 48 89 08 31 c0 <89> 04 25 00 00 00 00 0f 0b e8 7d 7a 12 fb 66 2e 0f 1f 84 00 00 00
> [ T8521] ------------[ cut here ]------------
> [ T8521] kernel BUG at mm/mmap.c:3521!

This is failing because the munmap count != mm->map_count on
exit_mmap().

> [ T8521] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ T8521] CPU: 6 UID: 1000 PID: 8521 Comm: Socket Thread Not tainted 6.10.0-rc6-next-20240703-00016-g09a756327684 #1416
> [ T8521] Hardware name: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021
> [ T8521] RIP: 0010:exit_mmap+0x28c/0x2a0
> [ T8521] Code: f7 45 31 ed e8 f5 7e e9 ff 4c 89 f7 e8 3d f4 6c 00 e9 63 ff ff ff 48 89 ef e8 70 11 04 00 e9 d1 fd ff ff 0f 0b e9 66 ff ff ff <0f> 0b e8 8d 0a 6c 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 0f
> [ T8521] RSP: 0018:ffffac759002fca0 EFLAGS: 00010297
> [ T8521] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [ T8521] RDX: 0000000000000001 RSI: ffff8bb7caca9700 RDI: ffff8bb7caca9708
> [ T8521] RBP: ffff8bb7c237e900 R08: 0000000000000000 R09: 000000000000000f
> [ T8521] R10: 00007ffcc2b9bfff R11: 0000000000000078 R12: 00000000000006b5
> [ T8521] R13: fffffffffffeb575 R14: ffff8bb7c237e9a8 R15: ffff8bb7c237e940
> [ T8521] FS:  0000000000000000(0000) GS:ffff8bc6adf80000(0000) knlGS:0000000000000000
> [ T8521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ T8521] CR2: 00007f2b6cae2000 CR3: 0000000fd1c18000 CR4: 0000000000750ef0
> [ T8521] PKRU: 55555554
> [ T8521] Call Trace:
> [ T8521]  <TASK>
> [ T8521]  ? die+0x31/0x80
> [ T8521]  ? do_trap+0xf1/0x100
> [ T8521]  ? exit_mmap+0x28c/0x2a0
> [ T8521]  ? do_error_trap+0x60/0x80
> [ T8521]  ? exit_mmap+0x28c/0x2a0
> [ T8521]  ? exc_invalid_op+0x4d/0x70
> [ T8521]  ? exit_mmap+0x28c/0x2a0
> [ T8521]  ? asm_exc_invalid_op+0x1a/0x20
> [ T8521]  ? exit_mmap+0x28c/0x2a0
> [ T8521]  ? mmput+0x50/0x120
> [ T8521]  ? do_exit+0x285/0x9c0
> [ T8521]  ? do_group_exit+0x2b/0x80
> [ T8521]  ? get_signal+0x731/0x7e0
> [ T8521]  ? arch_do_signal_or_restart+0x29/0x230
> [ T8521]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ T8521]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ T8521]  ? __x64_sys_futex+0x109/0x1c0
> [ T8521]  ? syscall_exit_to_user_mode+0x154/0x1a0
> [ T8521]  ? do_syscall_64+0x6b/0x170
> [ T8521]  ? entry_SYSCALL_64_after_hwframe+0x55/0x5d
> [ T8521]  </TASK>
> [ T8521] Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device rfcomm cmac bnep nls_ascii nls_cp437 vfat fat snd_ctl_led btusb snd_hda_codec_realtek btrtl btintel snd_hda_codec_generic btbcm btmtk snd_hda_scodec_component snd_hda_codec_hdmi bluetooth snd_hda_intel snd_intel_dspcfg uvcvideo snd_hda_codec videobuf2_vmalloc snd_acp3x_pdm_dma snd_soc_dmic snd_hwdep snd_acp3x_rn uvc amd_atl videobuf2_memops snd_soc_core snd_hda_core videobuf2_v4l2 snd_pcm_oss videodev snd_mixer_oss snd_pcm snd_rn_pci_acp3x videobuf2_common snd_acp_config snd_timer snd_soc_acpi msi_wmi ecdh_generic ecc sparse_keymap mc edac_mce_amd wmi_bmof snd ccp k10temp joydev soundcore snd_pci_acp3x battery ac button hid_sensor_gyro_3d hid_sensor_prox hid_sensor_als hid_sensor_magn_3d hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf industrialio amd_pmc hid_sensor_iio_common evdev hid_multitouch serio_raw mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76
> [ T8521]  mac80211 libarc4 cfg80211 rfkill msr fuse nvme_fabrics efi_pstore configfs efivarfs autofs4 ext4 crc32c_generic mbcache jbd2 usbhid amdgpu i2c_algo_bit drm_ttm_helper xhci_pci ttm drm_exec drm_suballoc_helper xhci_hcd amdxcp drm_buddy hid_sensor_hub usbcore nvme gpu_sched mfd_core hid_generic crc32c_intel psmouse i2c_piix4 amd_sfh drm_display_helper usb_common nvme_core crc16 r8169 i2c_hid_acpi i2c_hid hid i2c_designware_platform i2c_designware_core
> [ T8521] ---[ end trace 0000000000000000 ]---
> [ T8521] RIP: 0010:exit_mmap+0x28c/0x2a0
> [ T8521] Code: f7 45 31 ed e8 f5 7e e9 ff 4c 89 f7 e8 3d f4 6c 00 e9 63 ff ff ff 48 89 ef e8 70 11 04 00 e9 d1 fd ff ff 0f 0b e9 66 ff ff ff <0f> 0b e8 8d 0a 6c 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 0f
> [ T8521] RSP: 0018:ffffac759002fca0 EFLAGS: 00010297
> [ T8521] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [ T8521] RDX: 0000000000000001 RSI: ffff8bb7caca9700 RDI: ffff8bb7caca9708
> [ T8521] RBP: ffff8bb7c237e900 R08: 0000000000000000 R09: 000000000000000f
> [ T8521] R10: 00007ffcc2b9bfff R11: 0000000000000078 R12: 00000000000006b5
> [ T8521] R13: fffffffffffeb575 R14: ffff8bb7c237e9a8 R15: ffff8bb7c237e940
> [ T8521] FS:  0000000000000000(0000) GS:ffff8bc6adfc0000(0000) knlGS:0000000000000000
> [ T8521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ T8521] CR2: 00007f460adce000 CR3: 0000000fd1c18000 CR4: 0000000000750ef0
> [ T8521] PKRU: 55555554
> [ T8521] Fixing recursive fault but reboot is needed!
> [ T8521] BUG: scheduling while atomic: Socket Thread/8521/0x00000000
> [ T8521] Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device rfcomm cmac bnep nls_ascii nls_cp437 vfat fat snd_ctl_led btusb snd_hda_codec_realtek btrtl btintel snd_hda_codec_generic btbcm btmtk snd_hda_scodec_component snd_hda_codec_hdmi bluetooth snd_hda_intel snd_intel_dspcfg uvcvideo snd_hda_codec videobuf2_vmalloc snd_acp3x_pdm_dma snd_soc_dmic snd_hwdep snd_acp3x_rn uvc amd_atl videobuf2_memops snd_soc_core snd_hda_core videobuf2_v4l2 snd_pcm_oss videodev snd_mixer_oss snd_pcm snd_rn_pci_acp3x videobuf2_common snd_acp_config snd_timer snd_soc_acpi msi_wmi ecdh_generic ecc sparse_keymap mc edac_mce_amd wmi_bmof snd ccp k10temp joydev soundcore snd_pci_acp3x battery ac button hid_sensor_gyro_3d hid_sensor_prox hid_sensor_als hid_sensor_magn_3d hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf industrialio amd_pmc hid_sensor_iio_common evdev hid_multitouch serio_raw mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76
> [ T8521]  mac80211 libarc4 cfg80211 rfkill msr fuse nvme_fabrics efi_pstore configfs efivarfs autofs4 ext4 crc32c_generic mbcache jbd2 usbhid amdgpu i2c_algo_bit drm_ttm_helper xhci_pci ttm drm_exec drm_suballoc_helper xhci_hcd amdxcp drm_buddy hid_sensor_hub usbcore nvme gpu_sched mfd_core hid_generic crc32c_intel psmouse i2c_piix4 amd_sfh drm_display_helper usb_common nvme_core crc16 r8169 i2c_hid_acpi i2c_hid hid i2c_designware_platform i2c_designware_core
> [ T8521] CPU: 7 UID: 1000 PID: 8521 Comm: Socket Thread Tainted: G      D            6.10.0-rc6-next-20240703-00016-g09a756327684 #1416
> [ T8521] Tainted: [D]=DIE
> [ T8521] Hardware name: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021
> [ T8521] Call Trace:
> [ T8521]  <TASK>
> [ T8521]  ? dump_stack_lvl+0x53/0x70
> [ T8521]  ? __schedule_bug+0x4d/0x60
> [ T8521]  ? __schedule+0x734/0x800
> [ T8521]  ? srso_alias_return_thunk+0x5/0xfbef5
> [ T8521]  ? _printk+0x57/0x80
> [ T8521]  ? do_task_dead+0x3d/0x40
> [ T8521]  ? make_task_dead+0x13b/0x160
> [ T8521]  ? rewind_stack_and_make_dead+0x16/0x20
> [ T8521]  </TASK>

It looks like it might have been in the process of killing the thread
group (terminating firefox ?) for another reason?

Considering that I modify the count under the mmap_lock in write mode
and that the map_count is verified in validate_mm(), I am eager to find
out if you are running with CONFIG_DEBUG_VM_MAPLE_TREE enabled.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
@ 2024-07-03 10:13 Bert Karwatzki
  2024-07-03 13:53 ` Liam R. Howlett
  0 siblings, 1 reply; 31+ messages in thread
From: Bert Karwatzki @ 2024-07-03 10:13 UTC (permalink / raw)
  To: Liam R . Howlett; +Cc: Bert Karwatzki, Andrew Morton, linux-mm

If been running the fixed patchset on top of linux-next for a few days now, so far 
without error but then I ran into this. After this the system went into a kernel panic
(freeze, flashing capslock) this is the last message preserverd in /var/log/kern.log. I
tried to emergency sync using magic sysrq, but that did not work so the actual panic message
is lost. I do not know how (or if) this is related to the patch set. 
The kernel used is linux-next-20240703 plus your v2 patchset plus one additional unrelated patch
(just #ifdef CONFIG_OF in drivers/pci/bus.c related to 
https://lore.kernel.org/all/20240612082019.19161-4-brgl@bgdev.pl/#t).


[ T8516] show_signal_msg: 16 callbacks suppressed
[ T8516] Isolated Web Co[8516]: segfault at 0 ip 00007f8c1f55fbe5 sp 00007ffcc2b97660 error 6 in libxul.so[4f98be5,7f8c1a686000+5f96000] likely on CPU 14 (core 7, socket 0)
[ T8516] Code: 48 8d 0d 63 a3 3c 01 48 89 08 c7 04 25 00 00 00 00 00 00 00 00 0f 0b 48 8b 05 47 1a e2 03 48 8d 0d 38 99 30 01 48 89 08 31 c0 <89> 04 25 00 00 00 00 0f 0b e8 7d 7a 12 fb 66 2e 0f 1f 84 00 00 00
[ T8521] ------------[ cut here ]------------
[ T8521] kernel BUG at mm/mmap.c:3521!
[ T8521] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ T8521] CPU: 6 UID: 1000 PID: 8521 Comm: Socket Thread Not tainted 6.10.0-rc6-next-20240703-00016-g09a756327684 #1416
[ T8521] Hardware name: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021
[ T8521] RIP: 0010:exit_mmap+0x28c/0x2a0
[ T8521] Code: f7 45 31 ed e8 f5 7e e9 ff 4c 89 f7 e8 3d f4 6c 00 e9 63 ff ff ff 48 89 ef e8 70 11 04 00 e9 d1 fd ff ff 0f 0b e9 66 ff ff ff <0f> 0b e8 8d 0a 6c 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 0f
[ T8521] RSP: 0018:ffffac759002fca0 EFLAGS: 00010297
[ T8521] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ T8521] RDX: 0000000000000001 RSI: ffff8bb7caca9700 RDI: ffff8bb7caca9708
[ T8521] RBP: ffff8bb7c237e900 R08: 0000000000000000 R09: 000000000000000f
[ T8521] R10: 00007ffcc2b9bfff R11: 0000000000000078 R12: 00000000000006b5
[ T8521] R13: fffffffffffeb575 R14: ffff8bb7c237e9a8 R15: ffff8bb7c237e940
[ T8521] FS:  0000000000000000(0000) GS:ffff8bc6adf80000(0000) knlGS:0000000000000000
[ T8521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T8521] CR2: 00007f2b6cae2000 CR3: 0000000fd1c18000 CR4: 0000000000750ef0
[ T8521] PKRU: 55555554
[ T8521] Call Trace:
[ T8521]  <TASK>
[ T8521]  ? die+0x31/0x80
[ T8521]  ? do_trap+0xf1/0x100
[ T8521]  ? exit_mmap+0x28c/0x2a0
[ T8521]  ? do_error_trap+0x60/0x80
[ T8521]  ? exit_mmap+0x28c/0x2a0
[ T8521]  ? exc_invalid_op+0x4d/0x70
[ T8521]  ? exit_mmap+0x28c/0x2a0
[ T8521]  ? asm_exc_invalid_op+0x1a/0x20
[ T8521]  ? exit_mmap+0x28c/0x2a0
[ T8521]  ? mmput+0x50/0x120
[ T8521]  ? do_exit+0x285/0x9c0
[ T8521]  ? do_group_exit+0x2b/0x80
[ T8521]  ? get_signal+0x731/0x7e0
[ T8521]  ? arch_do_signal_or_restart+0x29/0x230
[ T8521]  ? srso_alias_return_thunk+0x5/0xfbef5
[ T8521]  ? srso_alias_return_thunk+0x5/0xfbef5
[ T8521]  ? __x64_sys_futex+0x109/0x1c0
[ T8521]  ? syscall_exit_to_user_mode+0x154/0x1a0
[ T8521]  ? do_syscall_64+0x6b/0x170
[ T8521]  ? entry_SYSCALL_64_after_hwframe+0x55/0x5d
[ T8521]  </TASK>
[ T8521] Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device rfcomm cmac bnep nls_ascii nls_cp437 vfat fat snd_ctl_led btusb snd_hda_codec_realtek btrtl btintel snd_hda_codec_generic btbcm btmtk snd_hda_scodec_component snd_hda_codec_hdmi bluetooth snd_hda_intel snd_intel_dspcfg uvcvideo snd_hda_codec videobuf2_vmalloc snd_acp3x_pdm_dma snd_soc_dmic snd_hwdep snd_acp3x_rn uvc amd_atl videobuf2_memops snd_soc_core snd_hda_core videobuf2_v4l2 snd_pcm_oss videodev snd_mixer_oss snd_pcm snd_rn_pci_acp3x videobuf2_common snd_acp_config snd_timer snd_soc_acpi msi_wmi ecdh_generic ecc sparse_keymap mc edac_mce_amd wmi_bmof snd ccp k10temp joydev soundcore snd_pci_acp3x battery ac button hid_sensor_gyro_3d hid_sensor_prox hid_sensor_als hid_sensor_magn_3d hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf industrialio amd_pmc hid_sensor_iio_common evdev hid_multitouch serio_raw mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76
[ T8521]  mac80211 libarc4 cfg80211 rfkill msr fuse nvme_fabrics efi_pstore configfs efivarfs autofs4 ext4 crc32c_generic mbcache jbd2 usbhid amdgpu i2c_algo_bit drm_ttm_helper xhci_pci ttm drm_exec drm_suballoc_helper xhci_hcd amdxcp drm_buddy hid_sensor_hub usbcore nvme gpu_sched mfd_core hid_generic crc32c_intel psmouse i2c_piix4 amd_sfh drm_display_helper usb_common nvme_core crc16 r8169 i2c_hid_acpi i2c_hid hid i2c_designware_platform i2c_designware_core
[ T8521] ---[ end trace 0000000000000000 ]---
[ T8521] RIP: 0010:exit_mmap+0x28c/0x2a0
[ T8521] Code: f7 45 31 ed e8 f5 7e e9 ff 4c 89 f7 e8 3d f4 6c 00 e9 63 ff ff ff 48 89 ef e8 70 11 04 00 e9 d1 fd ff ff 0f 0b e9 66 ff ff ff <0f> 0b e8 8d 0a 6c 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 66 0f
[ T8521] RSP: 0018:ffffac759002fca0 EFLAGS: 00010297
[ T8521] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ T8521] RDX: 0000000000000001 RSI: ffff8bb7caca9700 RDI: ffff8bb7caca9708
[ T8521] RBP: ffff8bb7c237e900 R08: 0000000000000000 R09: 000000000000000f
[ T8521] R10: 00007ffcc2b9bfff R11: 0000000000000078 R12: 00000000000006b5
[ T8521] R13: fffffffffffeb575 R14: ffff8bb7c237e9a8 R15: ffff8bb7c237e940
[ T8521] FS:  0000000000000000(0000) GS:ffff8bc6adfc0000(0000) knlGS:0000000000000000
[ T8521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ T8521] CR2: 00007f460adce000 CR3: 0000000fd1c18000 CR4: 0000000000750ef0
[ T8521] PKRU: 55555554
[ T8521] Fixing recursive fault but reboot is needed!
[ T8521] BUG: scheduling while atomic: Socket Thread/8521/0x00000000
[ T8521] Modules linked in: ccm snd_seq_dummy snd_hrtimer snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device rfcomm cmac bnep nls_ascii nls_cp437 vfat fat snd_ctl_led btusb snd_hda_codec_realtek btrtl btintel snd_hda_codec_generic btbcm btmtk snd_hda_scodec_component snd_hda_codec_hdmi bluetooth snd_hda_intel snd_intel_dspcfg uvcvideo snd_hda_codec videobuf2_vmalloc snd_acp3x_pdm_dma snd_soc_dmic snd_hwdep snd_acp3x_rn uvc amd_atl videobuf2_memops snd_soc_core snd_hda_core videobuf2_v4l2 snd_pcm_oss videodev snd_mixer_oss snd_pcm snd_rn_pci_acp3x videobuf2_common snd_acp_config snd_timer snd_soc_acpi msi_wmi ecdh_generic ecc sparse_keymap mc edac_mce_amd wmi_bmof snd ccp k10temp joydev soundcore snd_pci_acp3x battery ac button hid_sensor_gyro_3d hid_sensor_prox hid_sensor_als hid_sensor_magn_3d hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf industrialio amd_pmc hid_sensor_iio_common evdev hid_multitouch serio_raw mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76
[ T8521]  mac80211 libarc4 cfg80211 rfkill msr fuse nvme_fabrics efi_pstore configfs efivarfs autofs4 ext4 crc32c_generic mbcache jbd2 usbhid amdgpu i2c_algo_bit drm_ttm_helper xhci_pci ttm drm_exec drm_suballoc_helper xhci_hcd amdxcp drm_buddy hid_sensor_hub usbcore nvme gpu_sched mfd_core hid_generic crc32c_intel psmouse i2c_piix4 amd_sfh drm_display_helper usb_common nvme_core crc16 r8169 i2c_hid_acpi i2c_hid hid i2c_designware_platform i2c_designware_core
[ T8521] CPU: 7 UID: 1000 PID: 8521 Comm: Socket Thread Tainted: G      D            6.10.0-rc6-next-20240703-00016-g09a756327684 #1416
[ T8521] Tainted: [D]=DIE
[ T8521] Hardware name: Micro-Star International Co., Ltd. Alpha 15 B5EEK/MS-158L, BIOS E158LAMS.107 11/10/2021
[ T8521] Call Trace:
[ T8521]  <TASK>
[ T8521]  ? dump_stack_lvl+0x53/0x70
[ T8521]  ? __schedule_bug+0x4d/0x60
[ T8521]  ? __schedule+0x734/0x800
[ T8521]  ? srso_alias_return_thunk+0x5/0xfbef5
[ T8521]  ? _printk+0x57/0x80
[ T8521]  ? do_task_dead+0x3d/0x40
[ T8521]  ? make_task_dead+0x13b/0x160
[ T8521]  ? rewind_stack_and_make_dead+0x16/0x20
[ T8521]  </TASK>

Bert Karwatzki



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/15] Avoid MAP_FIXED gap exposure
@ 2024-06-27  9:31 Bert Karwatzki
  0 siblings, 0 replies; 31+ messages in thread
From: Bert Karwatzki @ 2024-06-27  9:31 UTC (permalink / raw)
  To: Liam R . Howlett; +Cc: Bert Karwatzki, Andrew Morton, linux-mm

* Andrew Morton <akpm@linux-foundation.org> [240626 16:59]:
> On Tue, 25 Jun 2024 15:11:30 -0400 "Liam R. Howlett" <Liam.Howlett@oracle.com> wrote:
>
> > It is now possible to walk the vma tree using the rcu read locks and is
> > beneficial to do so to reduce lock contention.  Doing so while a
> > MAP_FIXED mapping is executing means that a reader may see a gap in the
> > vma tree that should never logically exist - and does not when using the
> > mmap lock in read mode.  The temporal gap exists because mmap_region()
> > calls munmap() prior to installing the new mapping.
>
> What are the consequences when this race hits?  IOW, why do we need to
> change anything?
>

If I understand this correctly the plan is to replace mmap_read_lock(mm) by
rcu_read_lock(). So the consequences of a visible gap could be tested by
replacing mmap_read_lock(mm) by rcu_read_lock() within the old code. If this is
the case I'm willing to test it.

Bert Karwatzki



^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-07-04 21:51 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-25 19:11 [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 01/15] mm/mmap: Correctly position vma_iterator in __split_vma() Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 02/15] mm/mmap: Introduce abort_munmap_vmas() Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 03/15] mm/mmap: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 04/15] mm/mmap: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 05/15] mm/mmap: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 06/15] mm/mmap: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 07/15] mm/mmap: Extract validate_mm() from vma_complete() Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 08/15] mm/mmap: Inline munmap operation in mmap_region() Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 09/15] mm/mmap: Expand mmap_region() munmap call Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 10/15] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 11/15] mm/mmap: Track start and end of munmap in vma_munmap_struct Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 12/15] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 13/15] mm/mmap: Use PHYS_PFN " Liam R. Howlett
2024-06-25 19:11 ` [PATCH v2 14/15] mm/mmap: Use vms accounted pages " Liam R. Howlett
2024-06-26 16:32   ` Kees Cook
2024-06-26 18:04     ` Liam R. Howlett
2024-06-26 18:45       ` Kees Cook
2024-06-25 19:11 ` [PATCH v2 15/15] mm/mmap: Move may_expand_vm() check " Liam R. Howlett
2024-06-26 20:58 ` [PATCH v2 00/15] Avoid MAP_FIXED gap exposure Andrew Morton
2024-06-27  1:15   ` Liam R. Howlett
2024-06-27  1:28     ` Andrew Morton
2024-06-27 13:31       ` Liam R. Howlett
2024-06-27  9:31 Bert Karwatzki
2024-07-03 10:13 Bert Karwatzki
2024-07-03 13:53 ` Liam R. Howlett
2024-07-03 14:30 Bert Karwatzki
2024-07-04 18:31 ` Liam R. Howlett
2024-07-04 21:07 Bert Karwatzki
2024-07-04 21:15 Bert Karwatzki
2024-07-04 21:50 Bert Karwatzki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox