* [PATCH v6 01/20] mm/vma: Correctly position vma_iterator in __split_vma()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 02/20] mm/vma: Introduce abort_munmap_vmas() Liam R. Howlett
` (18 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett,
Lorenzo Stoakes
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
The vma iterator may be left pointing to the newly created vma. This
happens when inserting the new vma at the end of the old vma
(!new_below).
The incorrect position in the vma iterator is not exposed currently
since the vma iterator is repositioned in the munmap path and is not
reused in any of the other paths.
This has limited impact in the current code, but is required for future
changes.
Fixes: b2b3b886738f ("mm: don't use __vma_adjust() in __split_vma()")
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
mm/vma.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/vma.c b/mm/vma.c
index 5850f7c0949b..066de79b7b73 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -177,7 +177,7 @@ void unmap_region(struct mm_struct *mm, struct ma_state *mas,
/*
* __split_vma() bypasses sysctl_max_map_count checking. We use this where it
* has already been checked or doesn't make sense to fail.
- * VMA Iterator will point to the end VMA.
+ * VMA Iterator will point to the original VMA.
*/
static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long addr, int new_below)
@@ -246,6 +246,9 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
/* Success. */
if (new_below)
vma_next(vmi);
+ else
+ vma_prev(vmi);
+
return 0;
out_free_mpol:
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 02/20] mm/vma: Introduce abort_munmap_vmas()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 01/20] mm/vma: Correctly position vma_iterator in __split_vma() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 03/20] mm/vma: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
` (17 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett,
Liam R . Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Extract clean up of failed munmap() operations from
do_vmi_align_munmap(). This simplifies later patches in the series.
It is worth noting that the mas_for_each() loop now has a different
upper limit. This should not change the number of vmas visited for
reattaching to the main vma tree (mm_mt), as all vmas are reattached in
both scenarios.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/vma.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index 066de79b7b73..58ecd447670d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -668,6 +668,22 @@ void vma_complete(struct vma_prepare *vp,
validate_mm(mm);
}
+/*
+ * abort_munmap_vmas - Undo any munmap work and free resources
+ *
+ * Reattach any detached vmas and free up the maple tree used to track the vmas.
+ */
+static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+{
+ struct vm_area_struct *vma;
+
+ mas_set(mas_detach, 0);
+ mas_for_each(mas_detach, vma, ULONG_MAX)
+ vma_mark_detached(vma, false);
+
+ __mt_destroy(mas_detach->tree);
+}
+
/*
* do_vmi_align_munmap() - munmap the aligned region from @start to @end.
* @vmi: The vma iterator
@@ -834,11 +850,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
userfaultfd_error:
munmap_gather_failed:
end_split_failed:
- mas_set(&mas_detach, 0);
- mas_for_each(&mas_detach, next, end)
- vma_mark_detached(next, false);
-
- __mt_destroy(&mt_detach);
+ abort_munmap_vmas(&mas_detach);
start_split_failed:
map_count_exceeded:
validate_mm(mm);
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 03/20] mm/vma: Introduce vmi_complete_munmap_vmas()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 01/20] mm/vma: Correctly position vma_iterator in __split_vma() Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 02/20] mm/vma: Introduce abort_munmap_vmas() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 04/20] mm/vma: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
` (16 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett,
Liam R . Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Extract all necessary operations that need to be completed after the vma
maple tree is updated from a munmap() operation. Extracting this makes
the later patch in the series easier to understand.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/vma.c | 80 ++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 55 insertions(+), 25 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index 58ecd447670d..3a2098464b8f 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -684,6 +684,58 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
__mt_destroy(mas_detach->tree);
}
+/*
+ * vmi_complete_munmap_vmas() - Finish the munmap() operation
+ * @vmi: The vma iterator
+ * @vma: The first vma to be munmapped
+ * @mm: The mm struct
+ * @start: The start address
+ * @end: The end address
+ * @unlock: Unlock the mm or not
+ * @mas_detach: them maple state of the detached vma maple tree
+ * @locked_vm: The locked_vm count in the detached vmas
+ *
+ * This function updates the mm_struct, unmaps the region, frees the resources
+ * used for the munmap() and may downgrade the lock - if requested. Everything
+ * needed to be done once the vma maple tree is updated.
+ */
+static void
+vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ struct mm_struct *mm, unsigned long start, unsigned long end,
+ bool unlock, struct ma_state *mas_detach,
+ unsigned long locked_vm)
+{
+ struct vm_area_struct *prev, *next;
+ int count;
+
+ count = mas_detach->index + 1;
+ mm->map_count -= count;
+ mm->locked_vm -= locked_vm;
+ if (unlock)
+ mmap_write_downgrade(mm);
+
+ prev = vma_iter_prev_range(vmi);
+ next = vma_next(vmi);
+ if (next)
+ vma_iter_prev_range(vmi);
+
+ /*
+ * We can free page tables without write-locking mmap_lock because VMAs
+ * were isolated before we downgraded mmap_lock.
+ */
+ mas_set(mas_detach, 1);
+ unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
+ !unlock);
+ /* Statistics and freeing VMAs */
+ mas_set(mas_detach, 0);
+ remove_mt(mm, mas_detach);
+ validate_mm(mm);
+ if (unlock)
+ mmap_read_unlock(mm);
+
+ __mt_destroy(mas_detach->tree);
+}
+
/*
* do_vmi_align_munmap() - munmap the aligned region from @start to @end.
* @vmi: The vma iterator
@@ -703,7 +755,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start,
unsigned long end, struct list_head *uf, bool unlock)
{
- struct vm_area_struct *prev, *next = NULL;
+ struct vm_area_struct *next = NULL;
struct maple_tree mt_detach;
int count = 0;
int error = -ENOMEM;
@@ -818,31 +870,9 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto clear_tree_failed;
/* Point of no return */
- mm->locked_vm -= locked_vm;
- mm->map_count -= count;
- if (unlock)
- mmap_write_downgrade(mm);
-
- prev = vma_iter_prev_range(vmi);
- next = vma_next(vmi);
- if (next)
- vma_iter_prev_range(vmi);
-
- /*
- * We can free page tables without write-locking mmap_lock because VMAs
- * were isolated before we downgraded mmap_lock.
- */
- mas_set(&mas_detach, 1);
- unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
- !unlock);
- /* Statistics and freeing VMAs */
- mas_set(&mas_detach, 0);
- remove_mt(mm, &mas_detach);
- validate_mm(mm);
- if (unlock)
- mmap_read_unlock(mm);
+ vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
+ locked_vm);
- __mt_destroy(&mt_detach);
return 0;
modify_vma_failed:
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 04/20] mm/vma: Extract the gathering of vmas from do_vmi_align_munmap()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (2 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 03/20] mm/vma: Introduce vmi_complete_munmap_vmas() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 05/20] mm/vma: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
` (15 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett,
Liam R . Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Create vmi_gather_munmap_vmas() to handle the gathering of vmas into a
detached maple tree for removal later. Part of the gathering is the
splitting of vmas that span the boundary.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/vma.c | 82 +++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 58 insertions(+), 24 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index 3a2098464b8f..da489063b2de 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -737,32 +737,30 @@ vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
}
/*
- * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * for removal at a later date. Handles splitting first and last if necessary
+ * and marking the vmas as isolated.
+ *
* @vmi: The vma iterator
* @vma: The starting vm_area_struct
* @mm: The mm_struct
* @start: The aligned start address to munmap.
* @end: The aligned end address to munmap.
* @uf: The userfaultfd list_head
- * @unlock: Set to true to drop the mmap_lock. unlocking only happens on
- * success.
+ * @mas_detach: The maple state tracking the detached tree
+ * @locked_vm: a pointer to store the VM_LOCKED pages count.
*
- * Return: 0 on success and drops the lock if so directed, error and leaves the
- * lock held otherwise.
+ * Return: 0 on success
*/
-int
-do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+static int
+vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start,
- unsigned long end, struct list_head *uf, bool unlock)
+ unsigned long end, struct list_head *uf,
+ struct ma_state *mas_detach, unsigned long *locked_vm)
{
struct vm_area_struct *next = NULL;
- struct maple_tree mt_detach;
int count = 0;
int error = -ENOMEM;
- unsigned long locked_vm = 0;
- MA_STATE(mas_detach, &mt_detach, 0, 0);
- mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
- mt_on_stack(mt_detach);
/*
* If we need to split any vma, do it now to save pain later.
@@ -812,15 +810,15 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto end_split_failed;
}
vma_start_write(next);
- mas_set(&mas_detach, count);
- error = mas_store_gfp(&mas_detach, next, GFP_KERNEL);
+ mas_set(mas_detach, count++);
+ error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
if (error)
goto munmap_gather_failed;
+
vma_mark_detached(next, true);
if (next->vm_flags & VM_LOCKED)
- locked_vm += vma_pages(next);
+ *locked_vm += vma_pages(next);
- count++;
if (unlikely(uf)) {
/*
* If userfaultfd_unmap_prep returns an error the vmas
@@ -845,7 +843,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
{
- MA_STATE(test, &mt_detach, 0, 0);
+ MA_STATE(test, mas_detach->tree, 0, 0);
struct vm_area_struct *vma_mas, *vma_test;
int test_count = 0;
@@ -865,6 +863,47 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
while (vma_iter_addr(vmi) > start)
vma_iter_prev_range(vmi);
+ return 0;
+
+userfaultfd_error:
+munmap_gather_failed:
+end_split_failed:
+ abort_munmap_vmas(mas_detach);
+start_split_failed:
+map_count_exceeded:
+ return error;
+}
+
+/*
+ * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
+ * @vmi: The vma iterator
+ * @vma: The starting vm_area_struct
+ * @mm: The mm_struct
+ * @start: The aligned start address to munmap.
+ * @end: The aligned end address to munmap.
+ * @uf: The userfaultfd list_head
+ * @unlock: Set to true to drop the mmap_lock. unlocking only happens on
+ * success.
+ *
+ * Return: 0 on success and drops the lock if so directed, error and leaves the
+ * lock held otherwise.
+ */
+int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ struct mm_struct *mm, unsigned long start, unsigned long end,
+ struct list_head *uf, bool unlock)
+{
+ struct maple_tree mt_detach;
+ MA_STATE(mas_detach, &mt_detach, 0, 0);
+ mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+ mt_on_stack(mt_detach);
+ int error;
+ unsigned long locked_vm = 0;
+
+ error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
+ &mas_detach, &locked_vm);
+ if (error)
+ goto gather_failed;
+
error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
if (error)
goto clear_tree_failed;
@@ -872,17 +911,12 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
/* Point of no return */
vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
locked_vm);
-
return 0;
modify_vma_failed:
clear_tree_failed:
-userfaultfd_error:
-munmap_gather_failed:
-end_split_failed:
abort_munmap_vmas(&mas_detach);
-start_split_failed:
-map_count_exceeded:
+gather_failed:
validate_mm(mm);
return error;
}
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 05/20] mm/vma: Introduce vma_munmap_struct for use in munmap operations
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (3 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 04/20] mm/vma: Extract the gathering of vmas from do_vmi_align_munmap() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 06/20] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
` (14 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett,
Liam R . Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Use a structure to pass along all the necessary information and counters
involved in removing vmas from the mm_struct.
Update vmi_ function names to vms_ to indicate the first argument
type change.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/vma.c | 142 +++++++++++++++++++++++++++++--------------------------
mm/vma.h | 16 +++++++
2 files changed, 91 insertions(+), 67 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index da489063b2de..e1aee43a3dc4 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -80,6 +80,32 @@ static void init_multi_vma_prep(struct vma_prepare *vp,
}
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation. Only unlocked on success
+ */
+static inline void init_vma_munmap(struct vma_munmap_struct *vms,
+ struct vma_iterator *vmi, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end, struct list_head *uf,
+ bool unlock)
+{
+ vms->vmi = vmi;
+ vms->vma = vma;
+ vms->mm = vma->vm_mm;
+ vms->start = start;
+ vms->end = end;
+ vms->unlock = unlock;
+ vms->uf = uf;
+ vms->vma_count = 0;
+ vms->nr_pages = vms->locked_vm = 0;
+}
+
/*
* Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff)
* in front of (at a lower virtual address and file offset than) the vma.
@@ -685,81 +711,62 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
}
/*
- * vmi_complete_munmap_vmas() - Finish the munmap() operation
- * @vmi: The vma iterator
- * @vma: The first vma to be munmapped
- * @mm: The mm struct
- * @start: The start address
- * @end: The end address
- * @unlock: Unlock the mm or not
- * @mas_detach: them maple state of the detached vma maple tree
- * @locked_vm: The locked_vm count in the detached vmas
+ * vms_complete_munmap_vmas() - Finish the munmap() operation
+ * @vms: The vma munmap struct
+ * @mas_detach: The maple state of the detached vmas
*
- * This function updates the mm_struct, unmaps the region, frees the resources
+ * This updates the mm_struct, unmaps the region, frees the resources
* used for the munmap() and may downgrade the lock - if requested. Everything
* needed to be done once the vma maple tree is updated.
*/
-static void
-vmi_complete_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
- struct mm_struct *mm, unsigned long start, unsigned long end,
- bool unlock, struct ma_state *mas_detach,
- unsigned long locked_vm)
+static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach)
{
struct vm_area_struct *prev, *next;
- int count;
+ struct mm_struct *mm;
- count = mas_detach->index + 1;
- mm->map_count -= count;
- mm->locked_vm -= locked_vm;
- if (unlock)
+ mm = vms->mm;
+ mm->map_count -= vms->vma_count;
+ mm->locked_vm -= vms->locked_vm;
+ if (vms->unlock)
mmap_write_downgrade(mm);
- prev = vma_iter_prev_range(vmi);
- next = vma_next(vmi);
+ prev = vma_iter_prev_range(vms->vmi);
+ next = vma_next(vms->vmi);
if (next)
- vma_iter_prev_range(vmi);
+ vma_iter_prev_range(vms->vmi);
/*
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
*/
mas_set(mas_detach, 1);
- unmap_region(mm, mas_detach, vma, prev, next, start, end, count,
- !unlock);
+ unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
+ vms->vma_count, !vms->unlock);
/* Statistics and freeing VMAs */
mas_set(mas_detach, 0);
remove_mt(mm, mas_detach);
validate_mm(mm);
- if (unlock)
+ if (vms->unlock)
mmap_read_unlock(mm);
__mt_destroy(mas_detach->tree);
}
/*
- * vmi_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
+ * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
* for removal at a later date. Handles splitting first and last if necessary
* and marking the vmas as isolated.
*
- * @vmi: The vma iterator
- * @vma: The starting vm_area_struct
- * @mm: The mm_struct
- * @start: The aligned start address to munmap.
- * @end: The aligned end address to munmap.
- * @uf: The userfaultfd list_head
+ * @vms: The vma munmap struct
* @mas_detach: The maple state tracking the detached tree
- * @locked_vm: a pointer to store the VM_LOCKED pages count.
*
* Return: 0 on success
*/
-static int
-vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
- struct mm_struct *mm, unsigned long start,
- unsigned long end, struct list_head *uf,
- struct ma_state *mas_detach, unsigned long *locked_vm)
+static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach)
{
struct vm_area_struct *next = NULL;
- int count = 0;
int error = -ENOMEM;
/*
@@ -771,23 +778,24 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
*/
/* Does it split the first one? */
- if (start > vma->vm_start) {
+ if (vms->start > vms->vma->vm_start) {
/*
* Make sure that map_count on return from munmap() will
* not exceed its limit; but let map_count go just above
* its limit temporarily, to help free resources as expected.
*/
- if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
+ if (vms->end < vms->vma->vm_end &&
+ vms->mm->map_count >= sysctl_max_map_count)
goto map_count_exceeded;
/* Don't bother splitting the VMA if we can't unmap it anyway */
- if (!can_modify_vma(vma)) {
+ if (!can_modify_vma(vms->vma)) {
error = -EPERM;
goto start_split_failed;
}
- error = __split_vma(vmi, vma, start, 1);
+ error = __split_vma(vms->vmi, vms->vma, vms->start, 1);
if (error)
goto start_split_failed;
}
@@ -796,7 +804,7 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
* Detach a range of VMAs from the mm. Using next as a temp variable as
* it is always overwritten.
*/
- next = vma;
+ next = vms->vma;
do {
if (!can_modify_vma(next)) {
error = -EPERM;
@@ -804,22 +812,22 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
}
/* Does it split the end? */
- if (next->vm_end > end) {
- error = __split_vma(vmi, next, end, 0);
+ if (next->vm_end > vms->end) {
+ error = __split_vma(vms->vmi, next, vms->end, 0);
if (error)
goto end_split_failed;
}
vma_start_write(next);
- mas_set(mas_detach, count++);
+ mas_set(mas_detach, vms->vma_count++);
error = mas_store_gfp(mas_detach, next, GFP_KERNEL);
if (error)
goto munmap_gather_failed;
vma_mark_detached(next, true);
if (next->vm_flags & VM_LOCKED)
- *locked_vm += vma_pages(next);
+ vms->locked_vm += vma_pages(next);
- if (unlikely(uf)) {
+ if (unlikely(vms->uf)) {
/*
* If userfaultfd_unmap_prep returns an error the vmas
* will remain split, but userland will get a
@@ -829,16 +837,17 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
* split, despite we could. This is unlikely enough
* failure that it's not worth optimizing it for.
*/
- error = userfaultfd_unmap_prep(next, start, end, uf);
+ error = userfaultfd_unmap_prep(next, vms->start,
+ vms->end, vms->uf);
if (error)
goto userfaultfd_error;
}
#ifdef CONFIG_DEBUG_VM_MAPLE_TREE
- BUG_ON(next->vm_start < start);
- BUG_ON(next->vm_start > end);
+ BUG_ON(next->vm_start < vms->start);
+ BUG_ON(next->vm_start > vms->end);
#endif
- } for_each_vma_range(*vmi, next, end);
+ } for_each_vma_range(*(vms->vmi), next, vms->end);
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
@@ -847,27 +856,28 @@ vmi_gather_munmap_vmas(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct vm_area_struct *vma_mas, *vma_test;
int test_count = 0;
- vma_iter_set(vmi, start);
+ vma_iter_set(vms->vmi, vms->start);
rcu_read_lock();
- vma_test = mas_find(&test, count - 1);
- for_each_vma_range(*vmi, vma_mas, end) {
+ vma_test = mas_find(&test, vms->vma_count - 1);
+ for_each_vma_range(*(vms->vmi), vma_mas, vms->end) {
BUG_ON(vma_mas != vma_test);
test_count++;
- vma_test = mas_next(&test, count - 1);
+ vma_test = mas_next(&test, vms->vma_count - 1);
}
rcu_read_unlock();
- BUG_ON(count != test_count);
+ BUG_ON(vms->vma_count != test_count);
}
#endif
- while (vma_iter_addr(vmi) > start)
- vma_iter_prev_range(vmi);
+ while (vma_iter_addr(vms->vmi) > vms->start)
+ vma_iter_prev_range(vms->vmi);
return 0;
userfaultfd_error:
munmap_gather_failed:
end_split_failed:
+modify_vma_failed:
abort_munmap_vmas(mas_detach);
start_split_failed:
map_count_exceeded:
@@ -896,11 +906,11 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
MA_STATE(mas_detach, &mt_detach, 0, 0);
mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
mt_on_stack(mt_detach);
+ struct vma_munmap_struct vms;
int error;
- unsigned long locked_vm = 0;
- error = vmi_gather_munmap_vmas(vmi, vma, mm, start, end, uf,
- &mas_detach, &locked_vm);
+ init_vma_munmap(&vms, vmi, vma, start, end, uf, unlock);
+ error = vms_gather_munmap_vmas(&vms, &mas_detach);
if (error)
goto gather_failed;
@@ -909,11 +919,9 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto clear_tree_failed;
/* Point of no return */
- vmi_complete_munmap_vmas(vmi, vma, mm, start, end, unlock, &mas_detach,
- locked_vm);
+ vms_complete_munmap_vmas(&vms, &mas_detach);
return 0;
-modify_vma_failed:
clear_tree_failed:
abort_munmap_vmas(&mas_detach);
gather_failed:
diff --git a/mm/vma.h b/mm/vma.h
index da31d0f62157..cb67acf59012 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -26,6 +26,22 @@ struct unlink_vma_file_batch {
struct vm_area_struct *vmas[8];
};
+/*
+ * vma munmap operation
+ */
+struct vma_munmap_struct {
+ struct vma_iterator *vmi;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma; /* The first vma to munmap */
+ struct list_head *uf; /* Userfaultfd list_head */
+ unsigned long start; /* Aligned start addr (inclusive) */
+ unsigned long end; /* Aligned end addr (exclusive) */
+ int vma_count; /* Number of vmas that will be removed */
+ unsigned long nr_pages; /* Number of pages being removed */
+ unsigned long locked_vm; /* Number of locked pages */
+ bool unlock; /* Unlock after the munmap */
+};
+
#ifdef CONFIG_DEBUG_VM_MAPLE_TREE
void validate_mm(struct mm_struct *mm);
#else
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 06/20] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (4 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 05/20] mm/vma: Introduce vma_munmap_struct for use in munmap operations Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-21 9:59 ` Lorenzo Stoakes
2024-08-20 23:57 ` [PATCH v6 07/20] mm/vma: Extract validate_mm() from vma_complete() Liam R. Howlett
` (13 subsequent siblings)
19 siblings, 1 reply; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett,
Liam R . Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Clean up the code by changing the munmap operation to use a structure
for the accounting and munmap variables.
Since remove_mt() is only called in one location and the contents will
be reduced to almost nothing. The remains of the function can be added
to vms_complete_munmap_vmas().
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/vma.c | 79 ++++++++++++++++++++++++++++----------------------------
mm/vma.h | 6 +++++
2 files changed, 46 insertions(+), 39 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index e1aee43a3dc4..7b8b8b983399 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -103,7 +103,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
vms->unlock = unlock;
vms->uf = uf;
vms->vma_count = 0;
- vms->nr_pages = vms->locked_vm = 0;
+ vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+ vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
}
/*
@@ -299,30 +300,6 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
return __split_vma(vmi, vma, addr, new_below);
}
-/*
- * Ok - we have the memory areas we should free on a maple tree so release them,
- * and do the vma updates.
- *
- * Called with the mm semaphore held.
- */
-static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
-{
- unsigned long nr_accounted = 0;
- struct vm_area_struct *vma;
-
- /* Update high watermark before we lower total_vm */
- update_hiwater_vm(mm);
- mas_for_each(mas, vma, ULONG_MAX) {
- long nrpages = vma_pages(vma);
-
- if (vma->vm_flags & VM_ACCOUNT)
- nr_accounted += nrpages;
- vm_stat_account(mm, vma->vm_flags, -nrpages);
- remove_vma(vma, false);
- }
- vm_unacct_memory(nr_accounted);
-}
-
/*
* init_vma_prep() - Initializer wrapper for vma_prepare struct
* @vp: The vma_prepare struct
@@ -722,7 +699,7 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
struct ma_state *mas_detach)
{
- struct vm_area_struct *prev, *next;
+ struct vm_area_struct *vma;
struct mm_struct *mm;
mm = vms->mm;
@@ -731,21 +708,26 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
if (vms->unlock)
mmap_write_downgrade(mm);
- prev = vma_iter_prev_range(vms->vmi);
- next = vma_next(vms->vmi);
- if (next)
- vma_iter_prev_range(vms->vmi);
-
/*
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
*/
mas_set(mas_detach, 1);
- unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
- vms->vma_count, !vms->unlock);
- /* Statistics and freeing VMAs */
+ unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
+ vms->start, vms->end, vms->vma_count, !vms->unlock);
+ /* Update high watermark before we lower total_vm */
+ update_hiwater_vm(mm);
+ /* Stat accounting */
+ WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
+ mm->exec_vm -= vms->exec_vm;
+ mm->stack_vm -= vms->stack_vm;
+ mm->data_vm -= vms->data_vm;
+ /* Remove and clean up vmas */
mas_set(mas_detach, 0);
- remove_mt(mm, mas_detach);
+ mas_for_each(mas_detach, vma, ULONG_MAX)
+ remove_vma(vma, false);
+
+ vm_unacct_memory(vms->nr_accounted);
validate_mm(mm);
if (vms->unlock)
mmap_read_unlock(mm);
@@ -799,18 +781,19 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
if (error)
goto start_split_failed;
}
+ vms->prev = vma_prev(vms->vmi);
/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
* it is always overwritten.
*/
- next = vms->vma;
- do {
+ for_each_vma_range(*(vms->vmi), next, vms->end) {
+ long nrpages;
+
if (!can_modify_vma(next)) {
error = -EPERM;
goto modify_vma_failed;
}
-
/* Does it split the end? */
if (next->vm_end > vms->end) {
error = __split_vma(vms->vmi, next, vms->end, 0);
@@ -824,6 +807,22 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
goto munmap_gather_failed;
vma_mark_detached(next, true);
+ nrpages = vma_pages(next);
+
+ vms->nr_pages += nrpages;
+ if (next->vm_flags & VM_LOCKED)
+ vms->locked_vm += nrpages;
+
+ if (next->vm_flags & VM_ACCOUNT)
+ vms->nr_accounted += nrpages;
+
+ if (is_exec_mapping(next->vm_flags))
+ vms->exec_vm += nrpages;
+ else if (is_stack_mapping(next->vm_flags))
+ vms->stack_vm += nrpages;
+ else if (is_data_mapping(next->vm_flags))
+ vms->data_vm += nrpages;
+
if (next->vm_flags & VM_LOCKED)
vms->locked_vm += vma_pages(next);
@@ -847,7 +846,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
BUG_ON(next->vm_start < vms->start);
BUG_ON(next->vm_start > vms->end);
#endif
- } for_each_vma_range(*(vms->vmi), next, vms->end);
+ }
+
+ vms->next = vma_next(vms->vmi);
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
diff --git a/mm/vma.h b/mm/vma.h
index cb67acf59012..cbf55e0e0c4f 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -33,12 +33,18 @@ struct vma_munmap_struct {
struct vma_iterator *vmi;
struct mm_struct *mm;
struct vm_area_struct *vma; /* The first vma to munmap */
+ struct vm_area_struct *prev; /* vma before the munmap area */
+ struct vm_area_struct *next; /* vma after the munmap area */
struct list_head *uf; /* Userfaultfd list_head */
unsigned long start; /* Aligned start addr (inclusive) */
unsigned long end; /* Aligned end addr (exclusive) */
int vma_count; /* Number of vmas that will be removed */
unsigned long nr_pages; /* Number of pages being removed */
unsigned long locked_vm; /* Number of locked pages */
+ unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */
+ unsigned long exec_vm;
+ unsigned long stack_vm;
+ unsigned long data_vm;
bool unlock; /* Unlock after the munmap */
};
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 06/20] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
2024-08-20 23:57 ` [PATCH v6 06/20] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
@ 2024-08-21 9:59 ` Lorenzo Stoakes
2024-08-21 13:17 ` Liam R. Howlett
0 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2024-08-21 9:59 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
On Tue, Aug 20, 2024 at 07:57:15PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Clean up the code by changing the munmap operation to use a structure
> for the accounting and munmap variables.
>
> Since remove_mt() is only called in one location and the contents will
> be reduced to almost nothing. The remains of the function can be added
> to vms_complete_munmap_vmas().
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---
> mm/vma.c | 79 ++++++++++++++++++++++++++++----------------------------
> mm/vma.h | 6 +++++
> 2 files changed, 46 insertions(+), 39 deletions(-)
>
> diff --git a/mm/vma.c b/mm/vma.c
> index e1aee43a3dc4..7b8b8b983399 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
[snip]
> mm = vms->mm;
> @@ -731,21 +708,26 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> if (vms->unlock)
> mmap_write_downgrade(mm);
>
> - prev = vma_iter_prev_range(vms->vmi);
> - next = vma_next(vms->vmi);
> - if (next)
> - vma_iter_prev_range(vms->vmi);
> -
> /*
> * We can free page tables without write-locking mmap_lock because VMAs
> * were isolated before we downgraded mmap_lock.
> */
> mas_set(mas_detach, 1);
> - unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
> - vms->vma_count, !vms->unlock);
> - /* Statistics and freeing VMAs */
> + unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
> + vms->start, vms->end, vms->vma_count, !vms->unlock);
> + /* Update high watermark before we lower total_vm */
> + update_hiwater_vm(mm);
> + /* Stat accounting */
> + WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
> + mm->exec_vm -= vms->exec_vm;
> + mm->stack_vm -= vms->stack_vm;
> + mm->data_vm -= vms->data_vm;
See below, but I bisected a bug to this patch that manifested because of
miscalculated accounting. So I wonder whether it'd be a good idea to take
this opportunity, when updating mm->... stats to add some:
VM_WARN_ON(vms->exec_vm > mm->exec_vm);
etc. for each of the fields updated. This would help catch any accounting
issues like this with CONFIG_DEBUG_VM switched on.
[snip]
> @@ -824,6 +807,22 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> goto munmap_gather_failed;
>
> vma_mark_detached(next, true);
> + nrpages = vma_pages(next);
> +
> + vms->nr_pages += nrpages;
> + if (next->vm_flags & VM_LOCKED)
> + vms->locked_vm += nrpages;
So I bisected a self-test failure, memfd_secret specifically, to this
commit. This is because you are double-counting VM_LOCKED...
> +
> + if (next->vm_flags & VM_ACCOUNT)
> + vms->nr_accounted += nrpages;
> +
> + if (is_exec_mapping(next->vm_flags))
> + vms->exec_vm += nrpages;
> + else if (is_stack_mapping(next->vm_flags))
> + vms->stack_vm += nrpages;
> + else if (is_data_mapping(next->vm_flags))
> + vms->data_vm += nrpages;
> +
> if (next->vm_flags & VM_LOCKED)
> vms->locked_vm += vma_pages(next);
...the double counting being right here :) so I think we should drop the
above couple lines.
>
> @@ -847,7 +846,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> BUG_ON(next->vm_start < vms->start);
> BUG_ON(next->vm_start > vms->end);
> #endif
> - } for_each_vma_range(*(vms->vmi), next, vms->end);
> + }
> +
> + vms->next = vma_next(vms->vmi);
>
> #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> /* Make sure no VMAs are about to be lost. */
> diff --git a/mm/vma.h b/mm/vma.h
> index cb67acf59012..cbf55e0e0c4f 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -33,12 +33,18 @@ struct vma_munmap_struct {
> struct vma_iterator *vmi;
> struct mm_struct *mm;
> struct vm_area_struct *vma; /* The first vma to munmap */
> + struct vm_area_struct *prev; /* vma before the munmap area */
> + struct vm_area_struct *next; /* vma after the munmap area */
> struct list_head *uf; /* Userfaultfd list_head */
> unsigned long start; /* Aligned start addr (inclusive) */
> unsigned long end; /* Aligned end addr (exclusive) */
> int vma_count; /* Number of vmas that will be removed */
> unsigned long nr_pages; /* Number of pages being removed */
> unsigned long locked_vm; /* Number of locked pages */
> + unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */
> + unsigned long exec_vm;
> + unsigned long stack_vm;
> + unsigned long data_vm;
> bool unlock; /* Unlock after the munmap */
> };
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 06/20] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas
2024-08-21 9:59 ` Lorenzo Stoakes
@ 2024-08-21 13:17 ` Liam R. Howlett
0 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-21 13:17 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240821 05:59]:
> On Tue, Aug 20, 2024 at 07:57:15PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Clean up the code by changing the munmap operation to use a structure
> > for the accounting and munmap variables.
> >
> > Since remove_mt() is only called in one location and the contents will
> > be reduced to almost nothing. The remains of the function can be added
> > to vms_complete_munmap_vmas().
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > mm/vma.c | 79 ++++++++++++++++++++++++++++----------------------------
> > mm/vma.h | 6 +++++
> > 2 files changed, 46 insertions(+), 39 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index e1aee43a3dc4..7b8b8b983399 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
>
> [snip]
>
> > mm = vms->mm;
> > @@ -731,21 +708,26 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> > if (vms->unlock)
> > mmap_write_downgrade(mm);
> >
> > - prev = vma_iter_prev_range(vms->vmi);
> > - next = vma_next(vms->vmi);
> > - if (next)
> > - vma_iter_prev_range(vms->vmi);
> > -
> > /*
> > * We can free page tables without write-locking mmap_lock because VMAs
> > * were isolated before we downgraded mmap_lock.
> > */
> > mas_set(mas_detach, 1);
> > - unmap_region(mm, mas_detach, vms->vma, prev, next, vms->start, vms->end,
> > - vms->vma_count, !vms->unlock);
> > - /* Statistics and freeing VMAs */
> > + unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
> > + vms->start, vms->end, vms->vma_count, !vms->unlock);
> > + /* Update high watermark before we lower total_vm */
> > + update_hiwater_vm(mm);
> > + /* Stat accounting */
> > + WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm) - vms->nr_pages);
> > + mm->exec_vm -= vms->exec_vm;
> > + mm->stack_vm -= vms->stack_vm;
> > + mm->data_vm -= vms->data_vm;
>
> See below, but I bisected a bug to this patch that manifested because of
> miscalculated accounting. So I wonder whether it'd be a good idea to take
> this opportunity, when updating mm->... stats to add some:
>
> VM_WARN_ON(vms->exec_vm > mm->exec_vm);
>
> etc. for each of the fields updated. This would help catch any accounting
> issues like this with CONFIG_DEBUG_VM switched on.
Sounds good.
>
> [snip]
>
> > @@ -824,6 +807,22 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > goto munmap_gather_failed;
> >
> > vma_mark_detached(next, true);
> > + nrpages = vma_pages(next);
> > +
> > + vms->nr_pages += nrpages;
> > + if (next->vm_flags & VM_LOCKED)
> > + vms->locked_vm += nrpages;
>
> So I bisected a self-test failure, memfd_secret specifically, to this
> commit. This is because you are double-counting VM_LOCKED...
>
> > +
> > + if (next->vm_flags & VM_ACCOUNT)
> > + vms->nr_accounted += nrpages;
> > +
> > + if (is_exec_mapping(next->vm_flags))
> > + vms->exec_vm += nrpages;
> > + else if (is_stack_mapping(next->vm_flags))
> > + vms->stack_vm += nrpages;
> > + else if (is_data_mapping(next->vm_flags))
> > + vms->data_vm += nrpages;
> > +
> > if (next->vm_flags & VM_LOCKED)
> > vms->locked_vm += vma_pages(next);
>
> ...the double counting being right here :) so I think we should drop the
> above couple lines.
Yeah, sure.
Thanks for reviewing!
>
> >
> > @@ -847,7 +846,9 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > BUG_ON(next->vm_start < vms->start);
> > BUG_ON(next->vm_start > vms->end);
> > #endif
> > - } for_each_vma_range(*(vms->vmi), next, vms->end);
> > + }
> > +
> > + vms->next = vma_next(vms->vmi);
> >
> > #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > /* Make sure no VMAs are about to be lost. */
> > diff --git a/mm/vma.h b/mm/vma.h
> > index cb67acf59012..cbf55e0e0c4f 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -33,12 +33,18 @@ struct vma_munmap_struct {
> > struct vma_iterator *vmi;
> > struct mm_struct *mm;
> > struct vm_area_struct *vma; /* The first vma to munmap */
> > + struct vm_area_struct *prev; /* vma before the munmap area */
> > + struct vm_area_struct *next; /* vma after the munmap area */
> > struct list_head *uf; /* Userfaultfd list_head */
> > unsigned long start; /* Aligned start addr (inclusive) */
> > unsigned long end; /* Aligned end addr (exclusive) */
> > int vma_count; /* Number of vmas that will be removed */
> > unsigned long nr_pages; /* Number of pages being removed */
> > unsigned long locked_vm; /* Number of locked pages */
> > + unsigned long nr_accounted; /* Number of VM_ACCOUNT pages */
> > + unsigned long exec_vm;
> > + unsigned long stack_vm;
> > + unsigned long data_vm;
> > bool unlock; /* Unlock after the munmap */
> > };
> >
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v6 07/20] mm/vma: Extract validate_mm() from vma_complete()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (5 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 06/20] mm/vma: Change munmap to use vma_munmap_struct() for accounting and surrounding vmas Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 08/20] mm/vma: Inline munmap operation in mmap_region() Liam R. Howlett
` (12 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett,
Liam R . Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
vma_complete() will need to be called during an unsafe time to call
validate_mm(). Extract the call in all places now so that only one
location can be modified in the next change.
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 1 +
mm/vma.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 30ae4cb5cec9..112f2111c457 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1796,6 +1796,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma_iter_store(vmi, vma);
vma_complete(&vp, vmi, mm);
+ validate_mm(mm);
khugepaged_enter_vma(vma, flags);
goto out;
}
diff --git a/mm/vma.c b/mm/vma.c
index 7b8b8b983399..5abda4c49c83 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -269,6 +269,7 @@ static int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
/* vma_complete stores the new vma */
vma_complete(&vp, vmi, vma->vm_mm);
+ validate_mm(vma->vm_mm);
/* Success. */
if (new_below)
@@ -548,6 +549,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma_iter_store(vmi, vma);
vma_complete(&vp, vmi, vma->vm_mm);
+ validate_mm(vma->vm_mm);
return 0;
nomem:
@@ -589,6 +591,7 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma_iter_clear(vmi);
vma_set_range(vma, start, end, pgoff);
vma_complete(&vp, vmi, vma->vm_mm);
+ validate_mm(vma->vm_mm);
return 0;
}
@@ -668,7 +671,6 @@ void vma_complete(struct vma_prepare *vp,
}
if (vp->insert && vp->file)
uprobe_mmap(vp->insert);
- validate_mm(mm);
}
/*
@@ -1200,6 +1202,7 @@ static struct vm_area_struct
}
vma_complete(&vp, vmi, mm);
+ validate_mm(mm);
khugepaged_enter_vma(res, vm_flags);
return res;
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 08/20] mm/vma: Inline munmap operation in mmap_region()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (6 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 07/20] mm/vma: Extract validate_mm() from vma_complete() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 09/20] mm/vma: Expand mmap_region() munmap call Liam R. Howlett
` (11 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
mmap_region is already passed sanitized addr and len, so change the
call to do_vmi_munmap() to do_vmi_align_munmap() and inline the other
checks.
The inlining of the function and checks is an intermediate step in the
series so future patches are easier to follow.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 112f2111c457..0f5be29d48b6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1388,12 +1388,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return -ENOMEM;
}
- /* Unmap any existing mapping in the area */
- error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
- if (error == -EPERM)
- return error;
- else if (error)
- return -ENOMEM;
+ /* Find the first overlapping VMA */
+ vma = vma_find(&vmi, end);
+ if (vma) {
+ /* Unmap any existing mapping in the area */
+ if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+ return -ENOMEM;
+ vma = NULL;
+ }
/*
* Private writable mapping: check memory availability
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 09/20] mm/vma: Expand mmap_region() munmap call
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (7 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 08/20] mm/vma: Inline munmap operation in mmap_region() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 10/20] mm/vma: Support vma == NULL in init_vma_munmap() Liam R. Howlett
` (10 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Open code the do_vmi_align_munmap() call so that it can be broken up
later in the series.
This requires exposing a few more vma operations.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 26 ++++++++++++++++++++++----
mm/vma.c | 31 ++-----------------------------
mm/vma.h | 33 +++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+), 33 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 0f5be29d48b6..e7e6bf09b558 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1366,6 +1366,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
struct vm_area_struct *next, *prev, *merge;
pgoff_t pglen = len >> PAGE_SHIFT;
unsigned long charged = 0;
+ struct vma_munmap_struct vms;
+ struct ma_state mas_detach;
+ struct maple_tree mt_detach;
unsigned long end = addr + len;
unsigned long merge_start = addr, merge_end = end;
bool writable_file_mapping = false;
@@ -1391,10 +1394,27 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Find the first overlapping VMA */
vma = vma_find(&vmi, end);
if (vma) {
- /* Unmap any existing mapping in the area */
- if (do_vmi_align_munmap(&vmi, vma, mm, addr, end, uf, false))
+ mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+ mt_on_stack(mt_detach);
+ mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+ init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
+ /* Prepare to unmap any existing mapping in the area */
+ if (vms_gather_munmap_vmas(&vms, &mas_detach))
+ return -ENOMEM;
+
+ /* Remove any existing mappings from the vma tree */
+ if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
return -ENOMEM;
+
+ /* Unmap any existing mapping in the area */
+ vms_complete_munmap_vmas(&vms, &mas_detach);
+ next = vms.next;
+ prev = vms.prev;
+ vma_prev(&vmi);
vma = NULL;
+ } else {
+ next = vma_next(&vmi);
+ prev = vma_prev(&vmi);
}
/*
@@ -1407,8 +1427,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_flags |= VM_ACCOUNT;
}
- next = vma_next(&vmi);
- prev = vma_prev(&vmi);
if (vm_flags & VM_SPECIAL) {
if (prev)
vma_iter_next_range(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index 5abda4c49c83..2840cbaeff8b 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -80,33 +80,6 @@ static void init_multi_vma_prep(struct vma_prepare *vp,
}
-/*
- * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
- * @vms: The vma munmap struct
- * @vmi: The vma iterator
- * @vma: The first vm_area_struct to munmap
- * @start: The aligned start address to munmap
- * @end: The aligned end address to munmap
- * @uf: The userfaultfd list_head
- * @unlock: Unlock after the operation. Only unlocked on success
- */
-static inline void init_vma_munmap(struct vma_munmap_struct *vms,
- struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long start, unsigned long end, struct list_head *uf,
- bool unlock)
-{
- vms->vmi = vmi;
- vms->vma = vma;
- vms->mm = vma->vm_mm;
- vms->start = start;
- vms->end = end;
- vms->unlock = unlock;
- vms->uf = uf;
- vms->vma_count = 0;
- vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
- vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
-}
-
/*
* Return true if we can merge this (vm_flags,anon_vma,file,vm_pgoff)
* in front of (at a lower virtual address and file offset than) the vma.
@@ -698,7 +671,7 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
* used for the munmap() and may downgrade the lock - if requested. Everything
* needed to be done once the vma maple tree is updated.
*/
-static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
struct ma_state *mas_detach)
{
struct vm_area_struct *vma;
@@ -747,7 +720,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
*
* Return: 0 on success
*/
-static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
struct ma_state *mas_detach)
{
struct vm_area_struct *next = NULL;
diff --git a/mm/vma.h b/mm/vma.h
index cbf55e0e0c4f..e78b24d1cf83 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -78,6 +78,39 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long start, unsigned long end, pgoff_t pgoff);
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation. Only unlocked on success
+ */
+static inline void init_vma_munmap(struct vma_munmap_struct *vms,
+ struct vma_iterator *vmi, struct vm_area_struct *vma,
+ unsigned long start, unsigned long end, struct list_head *uf,
+ bool unlock)
+{
+ vms->vmi = vmi;
+ vms->vma = vma;
+ vms->mm = vma->vm_mm;
+ vms->start = start;
+ vms->end = end;
+ vms->unlock = unlock;
+ vms->uf = uf;
+ vms->vma_count = 0;
+ vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+ vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+}
+
+int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach);
+
+void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach);
+
int
do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start,
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 10/20] mm/vma: Support vma == NULL in init_vma_munmap()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (8 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 09/20] mm/vma: Expand mmap_region() munmap call Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-21 10:07 ` Lorenzo Stoakes
2024-08-20 23:57 ` [PATCH v6 11/20] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
` (9 subsequent siblings)
19 siblings, 1 reply; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Adding support for a NULL vma means the init_vma_munmap() can be
initialized for a less error-prone process when calling
vms_complete_munmap_vmas() later on.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
mm/mmap.c | 2 +-
mm/vma.h | 11 ++++++++---
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index e7e6bf09b558..2b7445a002dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1393,11 +1393,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Find the first overlapping VMA */
vma = vma_find(&vmi, end);
+ init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
if (vma) {
mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
mt_on_stack(mt_detach);
mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
- init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
/* Prepare to unmap any existing mapping in the area */
if (vms_gather_munmap_vmas(&vms, &mas_detach))
return -ENOMEM;
diff --git a/mm/vma.h b/mm/vma.h
index e78b24d1cf83..0e214bbf443e 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -95,9 +95,14 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
{
vms->vmi = vmi;
vms->vma = vma;
- vms->mm = vma->vm_mm;
- vms->start = start;
- vms->end = end;
+ if (vma) {
+ vms->mm = vma->vm_mm;
+ vms->start = start;
+ vms->end = end;
+ } else {
+ vms->mm = NULL;
+ vms->start = vms->end = 0;
+ }
vms->unlock = unlock;
vms->uf = uf;
vms->vma_count = 0;
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 10/20] mm/vma: Support vma == NULL in init_vma_munmap()
2024-08-20 23:57 ` [PATCH v6 10/20] mm/vma: Support vma == NULL in init_vma_munmap() Liam R. Howlett
@ 2024-08-21 10:07 ` Lorenzo Stoakes
2024-08-21 13:18 ` Liam R. Howlett
0 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2024-08-21 10:07 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
On Tue, Aug 20, 2024 at 07:57:19PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Adding support for a NULL vma means the init_vma_munmap() can be
> initialized for a less error-prone process when calling
> vms_complete_munmap_vmas() later on.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Thought I already R-b'd this :) Anyway,
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/mmap.c | 2 +-
> mm/vma.h | 11 ++++++++---
> 2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index e7e6bf09b558..2b7445a002dc 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1393,11 +1393,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
> /* Find the first overlapping VMA */
> vma = vma_find(&vmi, end);
> + init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> if (vma) {
> mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> mt_on_stack(mt_detach);
> mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> - init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> /* Prepare to unmap any existing mapping in the area */
> if (vms_gather_munmap_vmas(&vms, &mas_detach))
> return -ENOMEM;
> diff --git a/mm/vma.h b/mm/vma.h
> index e78b24d1cf83..0e214bbf443e 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -95,9 +95,14 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> {
> vms->vmi = vmi;
> vms->vma = vma;
> - vms->mm = vma->vm_mm;
> - vms->start = start;
> - vms->end = end;
> + if (vma) {
> + vms->mm = vma->vm_mm;
> + vms->start = start;
> + vms->end = end;
> + } else {
> + vms->mm = NULL;
> + vms->start = vms->end = 0;
> + }
> vms->unlock = unlock;
> vms->uf = uf;
> vms->vma_count = 0;
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 10/20] mm/vma: Support vma == NULL in init_vma_munmap()
2024-08-21 10:07 ` Lorenzo Stoakes
@ 2024-08-21 13:18 ` Liam R. Howlett
0 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-21 13:18 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240821 06:07]:
> On Tue, Aug 20, 2024 at 07:57:19PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Adding support for a NULL vma means the init_vma_munmap() can be
> > initialized for a less error-prone process when calling
> > vms_complete_munmap_vmas() later on.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
>
> Thought I already R-b'd this :) Anyway,
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
I probably lost it in my porting of patches somehow. Thanks
>
> > ---
> > mm/mmap.c | 2 +-
> > mm/vma.h | 11 ++++++++---
> > 2 files changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index e7e6bf09b558..2b7445a002dc 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1393,11 +1393,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >
> > /* Find the first overlapping VMA */
> > vma = vma_find(&vmi, end);
> > + init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > if (vma) {
> > mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > mt_on_stack(mt_detach);
> > mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > - init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > /* Prepare to unmap any existing mapping in the area */
> > if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > return -ENOMEM;
> > diff --git a/mm/vma.h b/mm/vma.h
> > index e78b24d1cf83..0e214bbf443e 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -95,9 +95,14 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > {
> > vms->vmi = vmi;
> > vms->vma = vma;
> > - vms->mm = vma->vm_mm;
> > - vms->start = start;
> > - vms->end = end;
> > + if (vma) {
> > + vms->mm = vma->vm_mm;
> > + vms->start = start;
> > + vms->end = end;
> > + } else {
> > + vms->mm = NULL;
> > + vms->start = vms->end = 0;
> > + }
> > vms->unlock = unlock;
> > vms->uf = uf;
> > vms->vma_count = 0;
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v6 11/20] mm/mmap: Reposition vma iterator in mmap_region()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (9 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 10/20] mm/vma: Support vma == NULL in init_vma_munmap() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 12/20] mm/vma: Track start and end for munmap in vma_munmap_struct Liam R. Howlett
` (8 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Instead of moving (or leaving) the vma iterator pointing at the previous
vma, leave it pointing at the insert location. Pointing the vma
iterator at the insert location allows for a cleaner walk of the vma
tree for MAP_FIXED and the no expansion cases.
The vma_prev() call in the case of merging the previous vma is
equivalent to vma_iter_prev_range(), since the vma iterator will be
pointing to the location just before the previous vma.
This change needs to export abort_munmap_vmas() from mm/vma.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 38 ++++++++++++++++++++++----------------
mm/vma.c | 16 ----------------
mm/vma.h | 16 ++++++++++++++++
3 files changed, 38 insertions(+), 32 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 2b7445a002dc..9285bdf14c4f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1400,21 +1400,22 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
/* Prepare to unmap any existing mapping in the area */
if (vms_gather_munmap_vmas(&vms, &mas_detach))
- return -ENOMEM;
+ goto gather_failed;
/* Remove any existing mappings from the vma tree */
if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
- return -ENOMEM;
+ goto clear_tree_failed;
/* Unmap any existing mapping in the area */
vms_complete_munmap_vmas(&vms, &mas_detach);
next = vms.next;
prev = vms.prev;
- vma_prev(&vmi);
vma = NULL;
} else {
next = vma_next(&vmi);
prev = vma_prev(&vmi);
+ if (prev)
+ vma_iter_next_range(&vmi);
}
/*
@@ -1427,11 +1428,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_flags |= VM_ACCOUNT;
}
- if (vm_flags & VM_SPECIAL) {
- if (prev)
- vma_iter_next_range(&vmi);
+ if (vm_flags & VM_SPECIAL)
goto cannot_expand;
- }
/* Attempt to expand an old mapping */
/* Check next */
@@ -1452,19 +1450,21 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
merge_start = prev->vm_start;
vma = prev;
vm_pgoff = prev->vm_pgoff;
- } else if (prev) {
- vma_iter_next_range(&vmi);
+ vma_prev(&vmi); /* Equivalent to going to the previous range */
}
- /* Actually expand, if possible */
- if (vma &&
- !vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
- khugepaged_enter_vma(vma, vm_flags);
- goto expanded;
+ if (vma) {
+ /* Actually expand, if possible */
+ if (!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
+ khugepaged_enter_vma(vma, vm_flags);
+ goto expanded;
+ }
+
+ /* If the expand fails, then reposition the vma iterator */
+ if (unlikely(vma == prev))
+ vma_iter_set(&vmi, addr);
}
- if (vma == prev)
- vma_iter_set(&vmi, addr);
cannot_expand:
/*
@@ -1625,6 +1625,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_unacct_memory(charged);
validate_mm(mm);
return error;
+
+clear_tree_failed:
+ abort_munmap_vmas(&mas_detach);
+gather_failed:
+ validate_mm(mm);
+ return -ENOMEM;
}
static int __vm_munmap(unsigned long start, size_t len, bool unlock)
diff --git a/mm/vma.c b/mm/vma.c
index 2840cbaeff8b..a8d716c39db2 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -646,22 +646,6 @@ void vma_complete(struct vma_prepare *vp,
uprobe_mmap(vp->insert);
}
-/*
- * abort_munmap_vmas - Undo any munmap work and free resources
- *
- * Reattach any detached vmas and free up the maple tree used to track the vmas.
- */
-static inline void abort_munmap_vmas(struct ma_state *mas_detach)
-{
- struct vm_area_struct *vma;
-
- mas_set(mas_detach, 0);
- mas_for_each(mas_detach, vma, ULONG_MAX)
- vma_mark_detached(vma, false);
-
- __mt_destroy(mas_detach->tree);
-}
-
/*
* vms_complete_munmap_vmas() - Finish the munmap() operation
* @vms: The vma munmap struct
diff --git a/mm/vma.h b/mm/vma.h
index 0e214bbf443e..c85fc7c888a8 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -116,6 +116,22 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
struct ma_state *mas_detach);
+/*
+ * abort_munmap_vmas - Undo any munmap work and free resources
+ *
+ * Reattach any detached vmas and free up the maple tree used to track the vmas.
+ */
+static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+{
+ struct vm_area_struct *vma;
+
+ mas_set(mas_detach, 0);
+ mas_for_each(mas_detach, vma, ULONG_MAX)
+ vma_mark_detached(vma, false);
+
+ __mt_destroy(mas_detach->tree);
+}
+
int
do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start,
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 12/20] mm/vma: Track start and end for munmap in vma_munmap_struct
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (10 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 11/20] mm/mmap: Reposition vma iterator in mmap_region() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 13/20] mm: Clean up unmap_region() argument list Liam R. Howlett
` (7 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Set the start and end address for munmap when the prev and next are
gathered. This is needed to avoid incorrect addresses being used during
the vms_complete_munmap_vmas() function if the prev/next vma are
expanded.
Add a new helper vms_complete_pte_clear(), which is needed later and
will avoid growing the argument list to unmap_region() beyond the 9 it
already has.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/vma.c | 32 +++++++++++++++++++++++++-------
mm/vma.h | 4 ++++
2 files changed, 29 insertions(+), 7 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index a8d716c39db2..5ba01dde52e7 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -646,6 +646,26 @@ void vma_complete(struct vma_prepare *vp,
uprobe_mmap(vp->insert);
}
+static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach, bool mm_wr_locked)
+{
+ struct mmu_gather tlb;
+
+ /*
+ * We can free page tables without write-locking mmap_lock because VMAs
+ * were isolated before we downgraded mmap_lock.
+ */
+ mas_set(mas_detach, 1);
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, vms->mm);
+ update_hiwater_rss(vms->mm);
+ unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
+ mas_set(mas_detach, 1);
+ /* start and end may be different if there is no prev or next vma. */
+ free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
+ tlb_finish_mmu(&tlb);
+}
+
/*
* vms_complete_munmap_vmas() - Finish the munmap() operation
* @vms: The vma munmap struct
@@ -667,13 +687,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
if (vms->unlock)
mmap_write_downgrade(mm);
- /*
- * We can free page tables without write-locking mmap_lock because VMAs
- * were isolated before we downgraded mmap_lock.
- */
- mas_set(mas_detach, 1);
- unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
- vms->start, vms->end, vms->vma_count, !vms->unlock);
+ vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
/* Update high watermark before we lower total_vm */
update_hiwater_vm(mm);
/* Stat accounting */
@@ -741,6 +755,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
goto start_split_failed;
}
vms->prev = vma_prev(vms->vmi);
+ if (vms->prev)
+ vms->unmap_start = vms->prev->vm_end;
/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
@@ -808,6 +824,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
}
vms->next = vma_next(vms->vmi);
+ if (vms->next)
+ vms->unmap_end = vms->next->vm_start;
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
diff --git a/mm/vma.h b/mm/vma.h
index c85fc7c888a8..7bc0f9e7751b 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -38,6 +38,8 @@ struct vma_munmap_struct {
struct list_head *uf; /* Userfaultfd list_head */
unsigned long start; /* Aligned start addr (inclusive) */
unsigned long end; /* Aligned end addr (exclusive) */
+ unsigned long unmap_start; /* Unmap PTE start */
+ unsigned long unmap_end; /* Unmap PTE end */
int vma_count; /* Number of vmas that will be removed */
unsigned long nr_pages; /* Number of pages being removed */
unsigned long locked_vm; /* Number of locked pages */
@@ -108,6 +110,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
vms->vma_count = 0;
vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+ vms->unmap_start = FIRST_USER_ADDRESS;
+ vms->unmap_end = USER_PGTABLES_CEILING;
}
int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 13/20] mm: Clean up unmap_region() argument list
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (11 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 12/20] mm/vma: Track start and end for munmap in vma_munmap_struct Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 14/20] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
` (6 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
With the only caller to unmap_region() being the error path of
mmap_region(), the argument list can be significantly reduced.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 3 +--
mm/vma.c | 17 ++++++++---------
mm/vma.h | 6 ++----
3 files changed, 11 insertions(+), 15 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 9285bdf14c4f..71b2bad717b6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1613,8 +1613,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma_iter_set(&vmi, vma->vm_end);
/* Undo any partial mapping done by a device driver. */
- unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
- vma->vm_end, vma->vm_end, true);
+ unmap_region(&vmi.mas, vma, prev, next);
}
if (writable_file_mapping)
mapping_unmap_writable(file->f_mapping);
diff --git a/mm/vma.c b/mm/vma.c
index 5ba01dde52e7..7104c2c080bb 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -155,22 +155,21 @@ void remove_vma(struct vm_area_struct *vma, bool unreachable)
*
* Called with the mm semaphore held.
*/
-void unmap_region(struct mm_struct *mm, struct ma_state *mas,
- struct vm_area_struct *vma, struct vm_area_struct *prev,
- struct vm_area_struct *next, unsigned long start,
- unsigned long end, unsigned long tree_end, bool mm_wr_locked)
+void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+ struct vm_area_struct *prev, struct vm_area_struct *next)
{
+ struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
- unsigned long mt_start = mas->index;
lru_add_drain();
tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
- mas_set(mas, mt_start);
+ unmap_vmas(&tlb, mas, vma, vma->vm_start, vma->vm_end, vma->vm_end,
+ /* mm_wr_locked = */ true);
+ mas_set(mas, vma->vm_end);
free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
- next ? next->vm_start : USER_PGTABLES_CEILING,
- mm_wr_locked);
+ next ? next->vm_start : USER_PGTABLES_CEILING,
+ /* mm_wr_locked = */ true);
tlb_finish_mmu(&tlb);
}
diff --git a/mm/vma.h b/mm/vma.h
index 7bc0f9e7751b..6028fdf79257 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -147,10 +147,8 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
void remove_vma(struct vm_area_struct *vma, bool unreachable);
-void unmap_region(struct mm_struct *mm, struct ma_state *mas,
- struct vm_area_struct *vma, struct vm_area_struct *prev,
- struct vm_area_struct *next, unsigned long start,
- unsigned long end, unsigned long tree_end, bool mm_wr_locked);
+void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
+ struct vm_area_struct *prev, struct vm_area_struct *next);
/* Required by mmap_region(). */
bool
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 14/20] mm/mmap: Avoid zeroing vma tree in mmap_region()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (12 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 13/20] mm: Clean up unmap_region() argument list Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-21 11:02 ` Lorenzo Stoakes
2024-08-20 23:57 ` [PATCH v6 15/20] mm: Change failure of MAP_FIXED to restoring the gap on failure Liam R. Howlett
` (5 subsequent siblings)
19 siblings, 1 reply; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Instead of zeroing the vma tree and then overwriting the area, let the
area be overwritten and then clean up the gathered vmas using
vms_complete_munmap_vmas().
To ensure locking is downgraded correctly, the mm is set regardless of
MAP_FIXED or not (NULL vma).
If a driver is mapping over an existing vma, then clear the ptes before
the call_mmap() invocation. This is done using the vms_clean_up_area()
helper. If there is a close vm_ops, that must also be called to ensure
any cleanup is done before mapping over the area. This also means that
calling open has been added to the abort of an unmap operation, for now.
Temporarily keep track of the number of pages that will be removed and
reduce the charged amount.
This also drops the validate_mm() call in the vma_expand() function.
It is necessary to drop the validate as it would fail since the mm
map_count would be incorrect during a vma expansion, prior to the
cleanup from vms_complete_munmap_vmas().
Clean up the error handing of the vms_gather_munmap_vmas() by calling
the verification within the function.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
mm/mmap.c | 62 ++++++++++++++++++++++++++-----------------------------
mm/vma.c | 54 +++++++++++++++++++++++++++++++++++++-----------
mm/vma.h | 22 ++++++++++++++------
3 files changed, 87 insertions(+), 51 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 71b2bad717b6..6550d9470d3a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1373,23 +1373,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long merge_start = addr, merge_end = end;
bool writable_file_mapping = false;
pgoff_t vm_pgoff;
- int error;
+ int error = -ENOMEM;
VMA_ITERATOR(vmi, mm, addr);
+ unsigned long nr_pages, nr_accounted;
- /* Check against address space limit. */
- if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
- unsigned long nr_pages;
+ nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
- /*
- * MAP_FIXED may remove pages of mappings that intersects with
- * requested mapping. Account for the pages it would unmap.
- */
- nr_pages = count_vma_pages_range(mm, addr, end);
-
- if (!may_expand_vm(mm, vm_flags,
- (len >> PAGE_SHIFT) - nr_pages))
- return -ENOMEM;
- }
+ /*
+ * Check against address space limit.
+ * MAP_FIXED may remove pages of mappings that intersects with requested
+ * mapping. Account for the pages it would unmap.
+ */
+ if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+ return -ENOMEM;
/* Find the first overlapping VMA */
vma = vma_find(&vmi, end);
@@ -1400,14 +1396,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
/* Prepare to unmap any existing mapping in the area */
if (vms_gather_munmap_vmas(&vms, &mas_detach))
- goto gather_failed;
-
- /* Remove any existing mappings from the vma tree */
- if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
- goto clear_tree_failed;
+ return -ENOMEM;
- /* Unmap any existing mapping in the area */
- vms_complete_munmap_vmas(&vms, &mas_detach);
next = vms.next;
prev = vms.prev;
vma = NULL;
@@ -1423,8 +1413,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
*/
if (accountable_mapping(file, vm_flags)) {
charged = len >> PAGE_SHIFT;
+ charged -= nr_accounted;
if (security_vm_enough_memory_mm(mm, charged))
- return -ENOMEM;
+ goto abort_munmap;
+ vms.nr_accounted = 0;
vm_flags |= VM_ACCOUNT;
}
@@ -1473,10 +1465,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* not unmapped, but the maps are removed from the list.
*/
vma = vm_area_alloc(mm);
- if (!vma) {
- error = -ENOMEM;
+ if (!vma)
goto unacct_error;
- }
vma_iter_config(&vmi, addr, end);
vma_set_range(vma, addr, end, pgoff);
@@ -1485,6 +1475,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (file) {
vma->vm_file = get_file(file);
+ /*
+ * call_mmap() may map PTE, so ensure there are no existing PTEs
+ * call the vm_ops close function if one exists.
+ */
+ vms_clean_up_area(&vms, &mas_detach, true);
error = call_mmap(file, vma);
if (error)
goto unmap_and_free_vma;
@@ -1575,6 +1570,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
expanded:
perf_event_mmap(vma);
+ /* Unmap any existing mapping in the area */
+ vms_complete_munmap_vmas(&vms, &mas_detach);
+
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
@@ -1603,7 +1601,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return addr;
close_and_free_vma:
- if (file && vma->vm_ops && vma->vm_ops->close)
+ if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
if (file || vma->vm_file) {
@@ -1622,14 +1620,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
unacct_error:
if (charged)
vm_unacct_memory(charged);
- validate_mm(mm);
- return error;
-clear_tree_failed:
- abort_munmap_vmas(&mas_detach);
-gather_failed:
+abort_munmap:
+ if (vms.nr_pages)
+ abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
validate_mm(mm);
- return -ENOMEM;
+ return error;
}
static int __vm_munmap(unsigned long start, size_t len, bool unlock)
@@ -1959,7 +1955,7 @@ void exit_mmap(struct mm_struct *mm)
do {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
- remove_vma(vma, true);
+ remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
count++;
cond_resched();
vma = vma_next(&vmi);
diff --git a/mm/vma.c b/mm/vma.c
index 7104c2c080bb..5b33f7460ab7 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -136,10 +136,10 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
/*
* Close a vm structure and free it.
*/
-void remove_vma(struct vm_area_struct *vma, bool unreachable)
+void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
{
might_sleep();
- if (vma->vm_ops && vma->vm_ops->close)
+ if (!closed && vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
if (vma->vm_file)
fput(vma->vm_file);
@@ -521,7 +521,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
vma_iter_store(vmi, vma);
vma_complete(&vp, vmi, vma->vm_mm);
- validate_mm(vma->vm_mm);
return 0;
nomem:
@@ -645,11 +644,14 @@ void vma_complete(struct vma_prepare *vp,
uprobe_mmap(vp->insert);
}
-static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
- struct ma_state *mas_detach, bool mm_wr_locked)
+static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach, bool mm_wr_locked)
{
struct mmu_gather tlb;
+ if (!vms->clear_ptes) /* Nothing to do */
+ return;
+
/*
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
@@ -658,11 +660,31 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
lru_add_drain();
tlb_gather_mmu(&tlb, vms->mm);
update_hiwater_rss(vms->mm);
- unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
+ unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end,
+ vms->vma_count, mm_wr_locked);
+
mas_set(mas_detach, 1);
/* start and end may be different if there is no prev or next vma. */
- free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
+ free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
+ vms->unmap_end, mm_wr_locked);
tlb_finish_mmu(&tlb);
+ vms->clear_ptes = false;
+}
+
+void vms_clean_up_area(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach, bool mm_wr_locked)
+{
+ struct vm_area_struct *vma;
+
+ if (!vms->nr_pages)
+ return;
+
+ vms_clear_ptes(vms, mas_detach, mm_wr_locked);
+ mas_set(mas_detach, 0);
+ mas_for_each(mas_detach, vma, ULONG_MAX)
+ if (vma->vm_ops && vma->vm_ops->close)
+ vma->vm_ops->close(vma);
+ vms->closed_vm_ops = true;
}
/*
@@ -686,7 +708,10 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
if (vms->unlock)
mmap_write_downgrade(mm);
- vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
+ if (!vms->nr_pages)
+ return;
+
+ vms_clear_ptes(vms, mas_detach, !vms->unlock);
/* Update high watermark before we lower total_vm */
update_hiwater_vm(mm);
/* Stat accounting */
@@ -697,7 +722,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
/* Remove and clean up vmas */
mas_set(mas_detach, 0);
mas_for_each(mas_detach, vma, ULONG_MAX)
- remove_vma(vma, false);
+ remove_vma(vma, /* = */ false, vms->closed_vm_ops);
vm_unacct_memory(vms->nr_accounted);
validate_mm(mm);
@@ -849,13 +874,14 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
while (vma_iter_addr(vms->vmi) > vms->start)
vma_iter_prev_range(vms->vmi);
+ vms->clear_ptes = true;
return 0;
userfaultfd_error:
munmap_gather_failed:
end_split_failed:
modify_vma_failed:
- abort_munmap_vmas(mas_detach);
+ abort_munmap_vmas(mas_detach, /* closed = */ false);
start_split_failed:
map_count_exceeded:
return error;
@@ -900,7 +926,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
return 0;
clear_tree_failed:
- abort_munmap_vmas(&mas_detach);
+ abort_munmap_vmas(&mas_detach, /* closed = */ false);
gather_failed:
validate_mm(mm);
return error;
@@ -1618,17 +1644,21 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
}
unsigned long count_vma_pages_range(struct mm_struct *mm,
- unsigned long addr, unsigned long end)
+ unsigned long addr, unsigned long end,
+ unsigned long *nr_accounted)
{
VMA_ITERATOR(vmi, mm, addr);
struct vm_area_struct *vma;
unsigned long nr_pages = 0;
+ *nr_accounted = 0;
for_each_vma_range(vmi, vma, end) {
unsigned long vm_start = max(addr, vma->vm_start);
unsigned long vm_end = min(end, vma->vm_end);
nr_pages += PHYS_PFN(vm_end - vm_start);
+ if (vma->vm_flags & VM_ACCOUNT)
+ *nr_accounted += PHYS_PFN(vm_end - vm_start);
}
return nr_pages;
diff --git a/mm/vma.h b/mm/vma.h
index 6028fdf79257..756dd42a6ec4 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -48,6 +48,8 @@ struct vma_munmap_struct {
unsigned long stack_vm;
unsigned long data_vm;
bool unlock; /* Unlock after the munmap */
+ bool clear_ptes; /* If there are outstanding PTE to be cleared */
+ bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */
};
#ifdef CONFIG_DEBUG_VM_MAPLE_TREE
@@ -95,14 +97,13 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
unsigned long start, unsigned long end, struct list_head *uf,
bool unlock)
{
+ vms->mm = current->mm;
vms->vmi = vmi;
vms->vma = vma;
if (vma) {
- vms->mm = vma->vm_mm;
vms->start = start;
vms->end = end;
} else {
- vms->mm = NULL;
vms->start = vms->end = 0;
}
vms->unlock = unlock;
@@ -112,6 +113,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
vms->unmap_start = FIRST_USER_ADDRESS;
vms->unmap_end = USER_PGTABLES_CEILING;
+ vms->clear_ptes = false;
+ vms->closed_vm_ops = false;
}
int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
@@ -120,18 +123,24 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
struct ma_state *mas_detach);
+void vms_clean_up_area(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach, bool mm_wr_locked);
+
/*
* abort_munmap_vmas - Undo any munmap work and free resources
*
* Reattach any detached vmas and free up the maple tree used to track the vmas.
*/
-static inline void abort_munmap_vmas(struct ma_state *mas_detach)
+static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
{
struct vm_area_struct *vma;
mas_set(mas_detach, 0);
- mas_for_each(mas_detach, vma, ULONG_MAX)
+ mas_for_each(mas_detach, vma, ULONG_MAX) {
vma_mark_detached(vma, false);
+ if (closed && vma->vm_ops && vma->vm_ops->open)
+ vma->vm_ops->open(vma);
+ }
__mt_destroy(mas_detach->tree);
}
@@ -145,7 +154,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
unsigned long start, size_t len, struct list_head *uf,
bool unlock);
-void remove_vma(struct vm_area_struct *vma, bool unreachable);
+void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
struct vm_area_struct *prev, struct vm_area_struct *next);
@@ -259,7 +268,8 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
int mm_take_all_locks(struct mm_struct *mm);
void mm_drop_all_locks(struct mm_struct *mm);
unsigned long count_vma_pages_range(struct mm_struct *mm,
- unsigned long addr, unsigned long end);
+ unsigned long addr, unsigned long end,
+ unsigned long *nr_accounted);
static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
{
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 14/20] mm/mmap: Avoid zeroing vma tree in mmap_region()
2024-08-20 23:57 ` [PATCH v6 14/20] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-08-21 11:02 ` Lorenzo Stoakes
2024-08-21 15:09 ` Liam R. Howlett
0 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2024-08-21 11:02 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
On Tue, Aug 20, 2024 at 07:57:23PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Instead of zeroing the vma tree and then overwriting the area, let the
> area be overwritten and then clean up the gathered vmas using
> vms_complete_munmap_vmas().
>
> To ensure locking is downgraded correctly, the mm is set regardless of
> MAP_FIXED or not (NULL vma).
>
> If a driver is mapping over an existing vma, then clear the ptes before
> the call_mmap() invocation. This is done using the vms_clean_up_area()
> helper. If there is a close vm_ops, that must also be called to ensure
> any cleanup is done before mapping over the area. This also means that
> calling open has been added to the abort of an unmap operation, for now.
Might be worth explicitly expanding this to say that this isn't a permanent
solution because of asymmetric vm_ops->open() / close().
>
> Temporarily keep track of the number of pages that will be removed and
> reduce the charged amount.
>
> This also drops the validate_mm() call in the vma_expand() function.
> It is necessary to drop the validate as it would fail since the mm
> map_count would be incorrect during a vma expansion, prior to the
> cleanup from vms_complete_munmap_vmas().
>
> Clean up the error handing of the vms_gather_munmap_vmas() by calling
> the verification within the function.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Broadly looks good, some nits and questions below, but generally:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/mmap.c | 62 ++++++++++++++++++++++++++-----------------------------
> mm/vma.c | 54 +++++++++++++++++++++++++++++++++++++-----------
> mm/vma.h | 22 ++++++++++++++------
> 3 files changed, 87 insertions(+), 51 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 71b2bad717b6..6550d9470d3a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1373,23 +1373,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> unsigned long merge_start = addr, merge_end = end;
> bool writable_file_mapping = false;
> pgoff_t vm_pgoff;
> - int error;
> + int error = -ENOMEM;
> VMA_ITERATOR(vmi, mm, addr);
> + unsigned long nr_pages, nr_accounted;
>
> - /* Check against address space limit. */
> - if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> - unsigned long nr_pages;
> + nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
>
> - /*
> - * MAP_FIXED may remove pages of mappings that intersects with
> - * requested mapping. Account for the pages it would unmap.
> - */
> - nr_pages = count_vma_pages_range(mm, addr, end);
> -
> - if (!may_expand_vm(mm, vm_flags,
> - (len >> PAGE_SHIFT) - nr_pages))
> - return -ENOMEM;
> - }
> + /*
> + * Check against address space limit.
> + * MAP_FIXED may remove pages of mappings that intersects with requested
> + * mapping. Account for the pages it would unmap.
> + */
> + if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> + return -ENOMEM;
>
> /* Find the first overlapping VMA */
> vma = vma_find(&vmi, end);
> @@ -1400,14 +1396,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> /* Prepare to unmap any existing mapping in the area */
> if (vms_gather_munmap_vmas(&vms, &mas_detach))
> - goto gather_failed;
> -
> - /* Remove any existing mappings from the vma tree */
> - if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> - goto clear_tree_failed;
> + return -ENOMEM;
>
> - /* Unmap any existing mapping in the area */
> - vms_complete_munmap_vmas(&vms, &mas_detach);
> next = vms.next;
> prev = vms.prev;
> vma = NULL;
> @@ -1423,8 +1413,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> */
> if (accountable_mapping(file, vm_flags)) {
> charged = len >> PAGE_SHIFT;
> + charged -= nr_accounted;
> if (security_vm_enough_memory_mm(mm, charged))
> - return -ENOMEM;
> + goto abort_munmap;
> + vms.nr_accounted = 0;
> vm_flags |= VM_ACCOUNT;
> }
>
> @@ -1473,10 +1465,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> * not unmapped, but the maps are removed from the list.
> */
> vma = vm_area_alloc(mm);
> - if (!vma) {
> - error = -ENOMEM;
> + if (!vma)
> goto unacct_error;
> - }
>
> vma_iter_config(&vmi, addr, end);
> vma_set_range(vma, addr, end, pgoff);
> @@ -1485,6 +1475,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>
> if (file) {
> vma->vm_file = get_file(file);
> + /*
> + * call_mmap() may map PTE, so ensure there are no existing PTEs
> + * call the vm_ops close function if one exists.
Super-nit, but maybe add an 'and' here.
> + */
> + vms_clean_up_area(&vms, &mas_detach, true);
I hate that we have to do this. These kind of hooks are the devil's works...
> error = call_mmap(file, vma);
> if (error)
> goto unmap_and_free_vma;
> @@ -1575,6 +1570,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> expanded:
> perf_event_mmap(vma);
>
> + /* Unmap any existing mapping in the area */
> + vms_complete_munmap_vmas(&vms, &mas_detach);
> +
> vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> if (vm_flags & VM_LOCKED) {
> if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> @@ -1603,7 +1601,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> return addr;
>
> close_and_free_vma:
> - if (file && vma->vm_ops && vma->vm_ops->close)
> + if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
> vma->vm_ops->close(vma);
>
> if (file || vma->vm_file) {
> @@ -1622,14 +1620,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> unacct_error:
> if (charged)
> vm_unacct_memory(charged);
> - validate_mm(mm);
> - return error;
>
> -clear_tree_failed:
> - abort_munmap_vmas(&mas_detach);
> -gather_failed:
> +abort_munmap:
> + if (vms.nr_pages)
> + abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> validate_mm(mm);
> - return -ENOMEM;
> + return error;
> }
>
> static int __vm_munmap(unsigned long start, size_t len, bool unlock)
> @@ -1959,7 +1955,7 @@ void exit_mmap(struct mm_struct *mm)
> do {
> if (vma->vm_flags & VM_ACCOUNT)
> nr_accounted += vma_pages(vma);
> - remove_vma(vma, true);
> + remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
> count++;
> cond_resched();
> vma = vma_next(&vmi);
> diff --git a/mm/vma.c b/mm/vma.c
> index 7104c2c080bb..5b33f7460ab7 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -136,10 +136,10 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
> /*
> * Close a vm structure and free it.
> */
> -void remove_vma(struct vm_area_struct *vma, bool unreachable)
> +void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
> {
> might_sleep();
> - if (vma->vm_ops && vma->vm_ops->close)
> + if (!closed && vma->vm_ops && vma->vm_ops->close)
> vma->vm_ops->close(vma);
> if (vma->vm_file)
> fput(vma->vm_file);
> @@ -521,7 +521,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> vma_iter_store(vmi, vma);
>
> vma_complete(&vp, vmi, vma->vm_mm);
> - validate_mm(vma->vm_mm);
> return 0;
>
> nomem:
> @@ -645,11 +644,14 @@ void vma_complete(struct vma_prepare *vp,
> uprobe_mmap(vp->insert);
> }
>
> -static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> - struct ma_state *mas_detach, bool mm_wr_locked)
> +static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
> + struct ma_state *mas_detach, bool mm_wr_locked)
> {
> struct mmu_gather tlb;
>
> + if (!vms->clear_ptes) /* Nothing to do */
> + return;
> +
> /*
> * We can free page tables without write-locking mmap_lock because VMAs
> * were isolated before we downgraded mmap_lock.
> @@ -658,11 +660,31 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> lru_add_drain();
> tlb_gather_mmu(&tlb, vms->mm);
> update_hiwater_rss(vms->mm);
> - unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
> + unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end,
> + vms->vma_count, mm_wr_locked);
> +
> mas_set(mas_detach, 1);
> /* start and end may be different if there is no prev or next vma. */
> - free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
> + free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
> + vms->unmap_end, mm_wr_locked);
> tlb_finish_mmu(&tlb);
> + vms->clear_ptes = false;
> +}
> +
> +void vms_clean_up_area(struct vma_munmap_struct *vms,
> + struct ma_state *mas_detach, bool mm_wr_locked)
The only invocation of this function has mm_wr_locked set, is this
parameter necessary?
> +{
> + struct vm_area_struct *vma;
> +
> + if (!vms->nr_pages)
> + return;
> +
> + vms_clear_ptes(vms, mas_detach, mm_wr_locked);
> + mas_set(mas_detach, 0);
> + mas_for_each(mas_detach, vma, ULONG_MAX)
> + if (vma->vm_ops && vma->vm_ops->close)
> + vma->vm_ops->close(vma);
> + vms->closed_vm_ops = true;
> }
>
> /*
> @@ -686,7 +708,10 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> if (vms->unlock)
> mmap_write_downgrade(mm);
>
> - vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
> + if (!vms->nr_pages)
> + return;
> +
> + vms_clear_ptes(vms, mas_detach, !vms->unlock);
> /* Update high watermark before we lower total_vm */
> update_hiwater_vm(mm);
> /* Stat accounting */
> @@ -697,7 +722,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> /* Remove and clean up vmas */
> mas_set(mas_detach, 0);
> mas_for_each(mas_detach, vma, ULONG_MAX)
> - remove_vma(vma, false);
> + remove_vma(vma, /* = */ false, vms->closed_vm_ops);
>
> vm_unacct_memory(vms->nr_accounted);
> validate_mm(mm);
> @@ -849,13 +874,14 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> while (vma_iter_addr(vms->vmi) > vms->start)
> vma_iter_prev_range(vms->vmi);
>
> + vms->clear_ptes = true;
> return 0;
>
> userfaultfd_error:
> munmap_gather_failed:
> end_split_failed:
> modify_vma_failed:
> - abort_munmap_vmas(mas_detach);
> + abort_munmap_vmas(mas_detach, /* closed = */ false);
> start_split_failed:
> map_count_exceeded:
> return error;
> @@ -900,7 +926,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> return 0;
>
> clear_tree_failed:
> - abort_munmap_vmas(&mas_detach);
> + abort_munmap_vmas(&mas_detach, /* closed = */ false);
> gather_failed:
> validate_mm(mm);
> return error;
> @@ -1618,17 +1644,21 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
> }
>
> unsigned long count_vma_pages_range(struct mm_struct *mm,
> - unsigned long addr, unsigned long end)
> + unsigned long addr, unsigned long end,
> + unsigned long *nr_accounted)
> {
> VMA_ITERATOR(vmi, mm, addr);
> struct vm_area_struct *vma;
> unsigned long nr_pages = 0;
>
> + *nr_accounted = 0;
> for_each_vma_range(vmi, vma, end) {
> unsigned long vm_start = max(addr, vma->vm_start);
> unsigned long vm_end = min(end, vma->vm_end);
>
> nr_pages += PHYS_PFN(vm_end - vm_start);
> + if (vma->vm_flags & VM_ACCOUNT)
> + *nr_accounted += PHYS_PFN(vm_end - vm_start);
Nitty, but maybe:
...
unsigned long pages = PHYS_PFN(vm_end - vm_start);
nr_pages += pages;
if (vma->vm_flags & VM_ACCOUNT)
*nr_accounted += pages;
> }
>
> return nr_pages;
> diff --git a/mm/vma.h b/mm/vma.h
> index 6028fdf79257..756dd42a6ec4 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -48,6 +48,8 @@ struct vma_munmap_struct {
> unsigned long stack_vm;
> unsigned long data_vm;
> bool unlock; /* Unlock after the munmap */
> + bool clear_ptes; /* If there are outstanding PTE to be cleared */
> + bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */
> };
>
> #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
> @@ -95,14 +97,13 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> unsigned long start, unsigned long end, struct list_head *uf,
> bool unlock)
> {
> + vms->mm = current->mm;
I'm guessing there's no circumstances under which we'd be looking at a
remote mm_struct?
This does sort of beg the question as to why we're bothering to store the
field if we can't just grab it from current->mm? Perhaps because the cache
line for the start of vms will be populated and current's containing ->mm
may not?
> vms->vmi = vmi;
> vms->vma = vma;
> if (vma) {
> - vms->mm = vma->vm_mm;
> vms->start = start;
> vms->end = end;
> } else {
> - vms->mm = NULL;
I guess as well there's no drawback to having an otherwise empty vms have a
populated mm?
> vms->start = vms->end = 0;
> }
> vms->unlock = unlock;
> @@ -112,6 +113,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> vms->unmap_start = FIRST_USER_ADDRESS;
> vms->unmap_end = USER_PGTABLES_CEILING;
> + vms->clear_ptes = false;
> + vms->closed_vm_ops = false;
> }
>
> int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> @@ -120,18 +123,24 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> struct ma_state *mas_detach);
>
> +void vms_clean_up_area(struct vma_munmap_struct *vms,
> + struct ma_state *mas_detach, bool mm_wr_locked);
> +
> /*
> * abort_munmap_vmas - Undo any munmap work and free resources
> *
> * Reattach any detached vmas and free up the maple tree used to track the vmas.
> */
> -static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> +static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> {
> struct vm_area_struct *vma;
>
> mas_set(mas_detach, 0);
> - mas_for_each(mas_detach, vma, ULONG_MAX)
> + mas_for_each(mas_detach, vma, ULONG_MAX) {
> vma_mark_detached(vma, false);
> + if (closed && vma->vm_ops && vma->vm_ops->open)
> + vma->vm_ops->open(vma);
> + }
Hang on, I thought we eliminated this approach? OK I see you change this in
the next commmit.
Not necessarily a huge fan of having a commit in the tree that's broken for
(hideous, asymmetric) drivers + such but I guess it's okay given we address
it immediately and it helps document the thinking process + split up the
code.
>
> __mt_destroy(mas_detach->tree);
> }
> @@ -145,7 +154,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> unsigned long start, size_t len, struct list_head *uf,
> bool unlock);
>
> -void remove_vma(struct vm_area_struct *vma, bool unreachable);
> +void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
>
> void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> struct vm_area_struct *prev, struct vm_area_struct *next);
> @@ -259,7 +268,8 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> int mm_take_all_locks(struct mm_struct *mm);
> void mm_drop_all_locks(struct mm_struct *mm);
> unsigned long count_vma_pages_range(struct mm_struct *mm,
> - unsigned long addr, unsigned long end);
> + unsigned long addr, unsigned long end,
> + unsigned long *nr_accounted);
>
> static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> {
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 14/20] mm/mmap: Avoid zeroing vma tree in mmap_region()
2024-08-21 11:02 ` Lorenzo Stoakes
@ 2024-08-21 15:09 ` Liam R. Howlett
0 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-21 15:09 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240821 07:02]:
> On Tue, Aug 20, 2024 at 07:57:23PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Instead of zeroing the vma tree and then overwriting the area, let the
> > area be overwritten and then clean up the gathered vmas using
> > vms_complete_munmap_vmas().
> >
> > To ensure locking is downgraded correctly, the mm is set regardless of
> > MAP_FIXED or not (NULL vma).
> >
> > If a driver is mapping over an existing vma, then clear the ptes before
> > the call_mmap() invocation. This is done using the vms_clean_up_area()
> > helper. If there is a close vm_ops, that must also be called to ensure
> > any cleanup is done before mapping over the area. This also means that
> > calling open has been added to the abort of an unmap operation, for now.
>
> Might be worth explicitly expanding this to say that this isn't a permanent
> solution because of asymmetric vm_ops->open() / close().
Yes, I will expand it.
>
> >
> > Temporarily keep track of the number of pages that will be removed and
> > reduce the charged amount.
> >
> > This also drops the validate_mm() call in the vma_expand() function.
> > It is necessary to drop the validate as it would fail since the mm
> > map_count would be incorrect during a vma expansion, prior to the
> > cleanup from vms_complete_munmap_vmas().
> >
> > Clean up the error handing of the vms_gather_munmap_vmas() by calling
> > the verification within the function.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
>
> Broadly looks good, some nits and questions below, but generally:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> > mm/mmap.c | 62 ++++++++++++++++++++++++++-----------------------------
> > mm/vma.c | 54 +++++++++++++++++++++++++++++++++++++-----------
> > mm/vma.h | 22 ++++++++++++++------
> > 3 files changed, 87 insertions(+), 51 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 71b2bad717b6..6550d9470d3a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1373,23 +1373,19 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > unsigned long merge_start = addr, merge_end = end;
> > bool writable_file_mapping = false;
> > pgoff_t vm_pgoff;
> > - int error;
> > + int error = -ENOMEM;
> > VMA_ITERATOR(vmi, mm, addr);
> > + unsigned long nr_pages, nr_accounted;
> >
> > - /* Check against address space limit. */
> > - if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
> > - unsigned long nr_pages;
> > + nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> >
> > - /*
> > - * MAP_FIXED may remove pages of mappings that intersects with
> > - * requested mapping. Account for the pages it would unmap.
> > - */
> > - nr_pages = count_vma_pages_range(mm, addr, end);
> > -
> > - if (!may_expand_vm(mm, vm_flags,
> > - (len >> PAGE_SHIFT) - nr_pages))
> > - return -ENOMEM;
> > - }
> > + /*
> > + * Check against address space limit.
> > + * MAP_FIXED may remove pages of mappings that intersects with requested
> > + * mapping. Account for the pages it would unmap.
> > + */
> > + if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
> > + return -ENOMEM;
> >
> > /* Find the first overlapping VMA */
> > vma = vma_find(&vmi, end);
> > @@ -1400,14 +1396,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > /* Prepare to unmap any existing mapping in the area */
> > if (vms_gather_munmap_vmas(&vms, &mas_detach))
> > - goto gather_failed;
> > -
> > - /* Remove any existing mappings from the vma tree */
> > - if (vma_iter_clear_gfp(&vmi, addr, end, GFP_KERNEL))
> > - goto clear_tree_failed;
> > + return -ENOMEM;
> >
> > - /* Unmap any existing mapping in the area */
> > - vms_complete_munmap_vmas(&vms, &mas_detach);
> > next = vms.next;
> > prev = vms.prev;
> > vma = NULL;
> > @@ -1423,8 +1413,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > */
> > if (accountable_mapping(file, vm_flags)) {
> > charged = len >> PAGE_SHIFT;
> > + charged -= nr_accounted;
> > if (security_vm_enough_memory_mm(mm, charged))
> > - return -ENOMEM;
> > + goto abort_munmap;
> > + vms.nr_accounted = 0;
> > vm_flags |= VM_ACCOUNT;
> > }
> >
> > @@ -1473,10 +1465,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > * not unmapped, but the maps are removed from the list.
> > */
> > vma = vm_area_alloc(mm);
> > - if (!vma) {
> > - error = -ENOMEM;
> > + if (!vma)
> > goto unacct_error;
> > - }
> >
> > vma_iter_config(&vmi, addr, end);
> > vma_set_range(vma, addr, end, pgoff);
> > @@ -1485,6 +1475,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >
> > if (file) {
> > vma->vm_file = get_file(file);
> > + /*
> > + * call_mmap() may map PTE, so ensure there are no existing PTEs
> > + * call the vm_ops close function if one exists.
>
> Super-nit, but maybe add an 'and' here.
I'm re-spinning anyways - thanks.
>
> > + */
> > + vms_clean_up_area(&vms, &mas_detach, true);
>
> I hate that we have to do this. These kind of hooks are the devil's works...
>
> > error = call_mmap(file, vma);
> > if (error)
> > goto unmap_and_free_vma;
> > @@ -1575,6 +1570,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > expanded:
> > perf_event_mmap(vma);
> >
> > + /* Unmap any existing mapping in the area */
> > + vms_complete_munmap_vmas(&vms, &mas_detach);
> > +
> > vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
> > if (vm_flags & VM_LOCKED) {
> > if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
> > @@ -1603,7 +1601,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > return addr;
> >
> > close_and_free_vma:
> > - if (file && vma->vm_ops && vma->vm_ops->close)
> > + if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
> > vma->vm_ops->close(vma);
> >
> > if (file || vma->vm_file) {
> > @@ -1622,14 +1620,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > unacct_error:
> > if (charged)
> > vm_unacct_memory(charged);
> > - validate_mm(mm);
> > - return error;
> >
> > -clear_tree_failed:
> > - abort_munmap_vmas(&mas_detach);
> > -gather_failed:
> > +abort_munmap:
> > + if (vms.nr_pages)
> > + abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> > validate_mm(mm);
> > - return -ENOMEM;
> > + return error;
> > }
> >
> > static int __vm_munmap(unsigned long start, size_t len, bool unlock)
> > @@ -1959,7 +1955,7 @@ void exit_mmap(struct mm_struct *mm)
> > do {
> > if (vma->vm_flags & VM_ACCOUNT)
> > nr_accounted += vma_pages(vma);
> > - remove_vma(vma, true);
> > + remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
> > count++;
> > cond_resched();
> > vma = vma_next(&vmi);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 7104c2c080bb..5b33f7460ab7 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -136,10 +136,10 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
> > /*
> > * Close a vm structure and free it.
> > */
> > -void remove_vma(struct vm_area_struct *vma, bool unreachable)
> > +void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
> > {
> > might_sleep();
> > - if (vma->vm_ops && vma->vm_ops->close)
> > + if (!closed && vma->vm_ops && vma->vm_ops->close)
> > vma->vm_ops->close(vma);
> > if (vma->vm_file)
> > fput(vma->vm_file);
> > @@ -521,7 +521,6 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > vma_iter_store(vmi, vma);
> >
> > vma_complete(&vp, vmi, vma->vm_mm);
> > - validate_mm(vma->vm_mm);
> > return 0;
> >
> > nomem:
> > @@ -645,11 +644,14 @@ void vma_complete(struct vma_prepare *vp,
> > uprobe_mmap(vp->insert);
> > }
> >
> > -static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > - struct ma_state *mas_detach, bool mm_wr_locked)
> > +static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
> > + struct ma_state *mas_detach, bool mm_wr_locked)
> > {
> > struct mmu_gather tlb;
> >
> > + if (!vms->clear_ptes) /* Nothing to do */
> > + return;
> > +
> > /*
> > * We can free page tables without write-locking mmap_lock because VMAs
> > * were isolated before we downgraded mmap_lock.
> > @@ -658,11 +660,31 @@ static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > lru_add_drain();
> > tlb_gather_mmu(&tlb, vms->mm);
> > update_hiwater_rss(vms->mm);
> > - unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
> > + unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end,
> > + vms->vma_count, mm_wr_locked);
> > +
> > mas_set(mas_detach, 1);
> > /* start and end may be different if there is no prev or next vma. */
> > - free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
> > + free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start,
> > + vms->unmap_end, mm_wr_locked);
> > tlb_finish_mmu(&tlb);
> > + vms->clear_ptes = false;
> > +}
> > +
> > +void vms_clean_up_area(struct vma_munmap_struct *vms,
> > + struct ma_state *mas_detach, bool mm_wr_locked)
>
> The only invocation of this function has mm_wr_locked set, is this
> parameter necessary?
I'll remove this, and replace the in-function pass-through with the
constant true.
>
> > +{
> > + struct vm_area_struct *vma;
> > +
> > + if (!vms->nr_pages)
> > + return;
> > +
> > + vms_clear_ptes(vms, mas_detach, mm_wr_locked);
> > + mas_set(mas_detach, 0);
> > + mas_for_each(mas_detach, vma, ULONG_MAX)
> > + if (vma->vm_ops && vma->vm_ops->close)
> > + vma->vm_ops->close(vma);
> > + vms->closed_vm_ops = true;
> > }
> >
> > /*
> > @@ -686,7 +708,10 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> > if (vms->unlock)
> > mmap_write_downgrade(mm);
> >
> > - vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
> > + if (!vms->nr_pages)
> > + return;
> > +
> > + vms_clear_ptes(vms, mas_detach, !vms->unlock);
> > /* Update high watermark before we lower total_vm */
> > update_hiwater_vm(mm);
> > /* Stat accounting */
> > @@ -697,7 +722,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> > /* Remove and clean up vmas */
> > mas_set(mas_detach, 0);
> > mas_for_each(mas_detach, vma, ULONG_MAX)
> > - remove_vma(vma, false);
> > + remove_vma(vma, /* = */ false, vms->closed_vm_ops);
> >
> > vm_unacct_memory(vms->nr_accounted);
> > validate_mm(mm);
> > @@ -849,13 +874,14 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > while (vma_iter_addr(vms->vmi) > vms->start)
> > vma_iter_prev_range(vms->vmi);
> >
> > + vms->clear_ptes = true;
> > return 0;
> >
> > userfaultfd_error:
> > munmap_gather_failed:
> > end_split_failed:
> > modify_vma_failed:
> > - abort_munmap_vmas(mas_detach);
> > + abort_munmap_vmas(mas_detach, /* closed = */ false);
> > start_split_failed:
> > map_count_exceeded:
> > return error;
> > @@ -900,7 +926,7 @@ int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > return 0;
> >
> > clear_tree_failed:
> > - abort_munmap_vmas(&mas_detach);
> > + abort_munmap_vmas(&mas_detach, /* closed = */ false);
> > gather_failed:
> > validate_mm(mm);
> > return error;
> > @@ -1618,17 +1644,21 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
> > }
> >
> > unsigned long count_vma_pages_range(struct mm_struct *mm,
> > - unsigned long addr, unsigned long end)
> > + unsigned long addr, unsigned long end,
> > + unsigned long *nr_accounted)
> > {
> > VMA_ITERATOR(vmi, mm, addr);
> > struct vm_area_struct *vma;
> > unsigned long nr_pages = 0;
> >
> > + *nr_accounted = 0;
> > for_each_vma_range(vmi, vma, end) {
> > unsigned long vm_start = max(addr, vma->vm_start);
> > unsigned long vm_end = min(end, vma->vm_end);
> >
> > nr_pages += PHYS_PFN(vm_end - vm_start);
> > + if (vma->vm_flags & VM_ACCOUNT)
> > + *nr_accounted += PHYS_PFN(vm_end - vm_start);
>
> Nitty, but maybe:
> ...
> unsigned long pages = PHYS_PFN(vm_end - vm_start);
>
> nr_pages += pages;
> if (vma->vm_flags & VM_ACCOUNT)
> *nr_accounted += pages;
>
This is a temporary state as count_vma_pages_range() is removed shortly
after.
> > }
> >
> > return nr_pages;
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 6028fdf79257..756dd42a6ec4 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -48,6 +48,8 @@ struct vma_munmap_struct {
> > unsigned long stack_vm;
> > unsigned long data_vm;
> > bool unlock; /* Unlock after the munmap */
> > + bool clear_ptes; /* If there are outstanding PTE to be cleared */
> > + bool closed_vm_ops; /* call_mmap() was encountered, so vmas may be closed */
> > };
> >
> > #ifdef CONFIG_DEBUG_VM_MAPLE_TREE
> > @@ -95,14 +97,13 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > unsigned long start, unsigned long end, struct list_head *uf,
> > bool unlock)
> > {
> > + vms->mm = current->mm;
>
> I'm guessing there's no circumstances under which we'd be looking at a
> remote mm_struct?
Yes, we always unmap things from the mm that contain them.
>
> This does sort of beg the question as to why we're bothering to store the
> field if we can't just grab it from current->mm? Perhaps because the cache
> line for the start of vms will be populated and current's containing ->mm
> may not?
If I don't do this here, then vms_clear_ptes(),
vms_gather_munmap_vmas(), and vms_complete_munmap_vmas() will need this.
It's not critical, but it means we're doing it twice instead of putting
it in the sturct that is used in those functions already..
Actually, I think I will do this so that I can reduce the cachelines of
the vma_munmap_struct by this reduction and rearranging.
I can use the vms->vma->vm_mm for vms_clear_ptes() and the gather case.
>
> > vms->vmi = vmi;
> > vms->vma = vma;
> > if (vma) {
> > - vms->mm = vma->vm_mm;
> > vms->start = start;
> > vms->end = end;
> > } else {
> > - vms->mm = NULL;
>
> I guess as well there's no drawback to having an otherwise empty vms have a
> populated mm?
It's actually needed because we may need to downgrade the mm lock, so
having it set makes vms_complete_munmap_vmas() cleaner.
But I'm removing it, so it doesn't really matter.
>
> > vms->start = vms->end = 0;
> > }
> > vms->unlock = unlock;
> > @@ -112,6 +113,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> > vms->unmap_start = FIRST_USER_ADDRESS;
> > vms->unmap_end = USER_PGTABLES_CEILING;
> > + vms->clear_ptes = false;
> > + vms->closed_vm_ops = false;
> > }
> >
> > int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > @@ -120,18 +123,24 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> > struct ma_state *mas_detach);
> >
> > +void vms_clean_up_area(struct vma_munmap_struct *vms,
> > + struct ma_state *mas_detach, bool mm_wr_locked);
> > +
> > /*
> > * abort_munmap_vmas - Undo any munmap work and free resources
> > *
> > * Reattach any detached vmas and free up the maple tree used to track the vmas.
> > */
> > -static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> > +static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> > {
> > struct vm_area_struct *vma;
> >
> > mas_set(mas_detach, 0);
> > - mas_for_each(mas_detach, vma, ULONG_MAX)
> > + mas_for_each(mas_detach, vma, ULONG_MAX) {
> > vma_mark_detached(vma, false);
> > + if (closed && vma->vm_ops && vma->vm_ops->open)
> > + vma->vm_ops->open(vma);
> > + }
>
> Hang on, I thought we eliminated this approach? OK I see you change this in
> the next commmit.
>
> Not necessarily a huge fan of having a commit in the tree that's broken for
> (hideous, asymmetric) drivers + such but I guess it's okay given we address
> it immediately and it helps document the thinking process + split up the
> code.
Yes, this is really to show where things would happen - and for this to
be an issue with bisection, one would have to test the failure paths and
hit a case where the failure would cause issues if the vma was closed
and re-opened. I think this is sufficiently rare.
>
> >
> > __mt_destroy(mas_detach->tree);
> > }
> > @@ -145,7 +154,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> > unsigned long start, size_t len, struct list_head *uf,
> > bool unlock);
> >
> > -void remove_vma(struct vm_area_struct *vma, bool unreachable);
> > +void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
> >
> > void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
> > struct vm_area_struct *prev, struct vm_area_struct *next);
> > @@ -259,7 +268,8 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
> > int mm_take_all_locks(struct mm_struct *mm);
> > void mm_drop_all_locks(struct mm_struct *mm);
> > unsigned long count_vma_pages_range(struct mm_struct *mm,
> > - unsigned long addr, unsigned long end);
> > + unsigned long addr, unsigned long end,
> > + unsigned long *nr_accounted);
> >
> > static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
> > {
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v6 15/20] mm: Change failure of MAP_FIXED to restoring the gap on failure
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (13 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 14/20] mm/mmap: Avoid zeroing vma tree in mmap_region() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-21 11:56 ` Lorenzo Stoakes
2024-08-20 23:57 ` [PATCH v6 16/20] mm/mmap: Use PHYS_PFN in mmap_region() Liam R. Howlett
` (4 subsequent siblings)
19 siblings, 1 reply; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Prior to call_mmap(), the vmas that will be replaced need to clear the
way for what may happen in the call_mmap(). This clean up work includes
clearing the ptes and calling the close() vm_ops. Some users do more
setup than can be restored by calling the vm_ops open() function. It is
safer to store the gap in the vma tree in these cases.
That is to say that the failure scenario that existed before the
MAP_FIXED gap exposure is restored as it is safer than trying to undo a
partial mapping.
There is also a secondary failure that may occur if there is not enough
memory to store the gap. In this case, the vmas are reattached and
resources freed. If the system cannot complete the call_mmap() and
fails to allocate with GFP_KERNEL, then the system will print a warning
about the failure.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
mm/mmap.c | 3 +--
mm/vma.h | 62 +++++++++++++++++++++++++++++++++++++------------------
2 files changed, 43 insertions(+), 22 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 6550d9470d3a..c1b3d17f97be 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1622,8 +1622,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vm_unacct_memory(charged);
abort_munmap:
- if (vms.nr_pages)
- abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
+ vms_abort_munmap_vmas(&vms, &mas_detach, vms.closed_vm_ops);
validate_mm(mm);
return error;
}
diff --git a/mm/vma.h b/mm/vma.h
index 756dd42a6ec4..7618ddbfd2b2 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long start, unsigned long end, pgoff_t pgoff);
+static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
+ struct vm_area_struct *vma, gfp_t gfp)
+
+{
+ if (vmi->mas.status != ma_start &&
+ ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
+ vma_iter_invalidate(vmi);
+
+ __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
+ mas_store_gfp(&vmi->mas, vma, gfp);
+ if (unlikely(mas_is_err(&vmi->mas)))
+ return -ENOMEM;
+
+ return 0;
+}
+
/*
* init_vma_munmap() - Initializer wrapper for vma_munmap_struct
* @vms: The vma munmap struct
@@ -136,15 +152,37 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
struct vm_area_struct *vma;
mas_set(mas_detach, 0);
- mas_for_each(mas_detach, vma, ULONG_MAX) {
+ mas_for_each(mas_detach, vma, ULONG_MAX)
vma_mark_detached(vma, false);
- if (closed && vma->vm_ops && vma->vm_ops->open)
- vma->vm_ops->open(vma);
- }
__mt_destroy(mas_detach->tree);
}
+static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
+ struct ma_state *mas_detach, bool closed)
+{
+ if (!vms->nr_pages)
+ return;
+
+ if (vms->clear_ptes)
+ return abort_munmap_vmas(mas_detach, vms->closed_vm_ops);
+
+ /*
+ * Aborting cannot just call the vm_ops open() because they are often
+ * not symmetrical and state data has been lost. Resort to the old
+ * failure method of leaving a gap where the MAP_FIXED mapping failed.
+ */
+ if (unlikely(vma_iter_store_gfp(vms->vmi, NULL, GFP_KERNEL))) {
+ pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
+ current->comm, current->pid);
+ /* Leaving vmas detached and in-tree may hamper recovery */
+ abort_munmap_vmas(mas_detach, vms->closed_vm_ops);
+ } else {
+ /* Clean up the insertion of unfortunate the gap */
+ vms_complete_munmap_vmas(vms, mas_detach);
+ }
+}
+
int
do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
struct mm_struct *mm, unsigned long start,
@@ -297,22 +335,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
return mas_prev(&vmi->mas, min);
}
-static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
- struct vm_area_struct *vma, gfp_t gfp)
-{
- if (vmi->mas.status != ma_start &&
- ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
- vma_iter_invalidate(vmi);
-
- __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
- mas_store_gfp(&vmi->mas, vma, gfp);
- if (unlikely(mas_is_err(&vmi->mas)))
- return -ENOMEM;
-
- return 0;
-}
-
-
/*
* These three helpers classifies VMAs for virtual memory accounting.
*/
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 15/20] mm: Change failure of MAP_FIXED to restoring the gap on failure
2024-08-20 23:57 ` [PATCH v6 15/20] mm: Change failure of MAP_FIXED to restoring the gap on failure Liam R. Howlett
@ 2024-08-21 11:56 ` Lorenzo Stoakes
2024-08-21 13:31 ` Liam R. Howlett
0 siblings, 1 reply; 32+ messages in thread
From: Lorenzo Stoakes @ 2024-08-21 11:56 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
On Tue, Aug 20, 2024 at 07:57:24PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Prior to call_mmap(), the vmas that will be replaced need to clear the
> way for what may happen in the call_mmap(). This clean up work includes
> clearing the ptes and calling the close() vm_ops. Some users do more
> setup than can be restored by calling the vm_ops open() function. It is
> safer to store the gap in the vma tree in these cases.
>
> That is to say that the failure scenario that existed before the
> MAP_FIXED gap exposure is restored as it is safer than trying to undo a
> partial mapping.
>
> There is also a secondary failure that may occur if there is not enough
> memory to store the gap. In this case, the vmas are reattached and
> resources freed. If the system cannot complete the call_mmap() and
> fails to allocate with GFP_KERNEL, then the system will print a warning
> about the failure.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Just some nits etc. below, otherwise:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/mmap.c | 3 +--
> mm/vma.h | 62 +++++++++++++++++++++++++++++++++++++------------------
> 2 files changed, 43 insertions(+), 22 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 6550d9470d3a..c1b3d17f97be 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1622,8 +1622,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> vm_unacct_memory(charged);
>
> abort_munmap:
> - if (vms.nr_pages)
> - abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> + vms_abort_munmap_vmas(&vms, &mas_detach, vms.closed_vm_ops);
> validate_mm(mm);
> return error;
> }
> diff --git a/mm/vma.h b/mm/vma.h
> index 756dd42a6ec4..7618ddbfd2b2 100644
> --- a/mm/vma.h
> +++ b/mm/vma.h
> @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> unsigned long start, unsigned long end, pgoff_t pgoff);
>
> +static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> + struct vm_area_struct *vma, gfp_t gfp)
> +
> +{
> + if (vmi->mas.status != ma_start &&
> + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> + vma_iter_invalidate(vmi);
> +
> + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> + mas_store_gfp(&vmi->mas, vma, gfp);
> + if (unlikely(mas_is_err(&vmi->mas)))
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> /*
> * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> * @vms: The vma munmap struct
> @@ -136,15 +152,37 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> struct vm_area_struct *vma;
>
> mas_set(mas_detach, 0);
> - mas_for_each(mas_detach, vma, ULONG_MAX) {
> + mas_for_each(mas_detach, vma, ULONG_MAX)
> vma_mark_detached(vma, false);
> - if (closed && vma->vm_ops && vma->vm_ops->open)
> - vma->vm_ops->open(vma);
> - }
Yes.
>
> __mt_destroy(mas_detach->tree);
> }
>
> +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> + struct ma_state *mas_detach, bool closed)
Nitty, but seems strange to comment abort_munmap_vmas() and say that it
undoes any munmap work + frees resources, but not to comment this function.
Also I wonder if, with these changes, abort_munmap_vmas() needs a rename?
Something like reattach_vmas() or something?
> +{
> + if (!vms->nr_pages)
> + return;
> +
> + if (vms->clear_ptes)
> + return abort_munmap_vmas(mas_detach, vms->closed_vm_ops);
> +
> + /*
> + * Aborting cannot just call the vm_ops open() because they are often
> + * not symmetrical and state data has been lost. Resort to the old
> + * failure method of leaving a gap where the MAP_FIXED mapping failed.
> + */
> + if (unlikely(vma_iter_store_gfp(vms->vmi, NULL, GFP_KERNEL))) {
> + pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
> + current->comm, current->pid);
> + /* Leaving vmas detached and in-tree may hamper recovery */
> + abort_munmap_vmas(mas_detach, vms->closed_vm_ops);
OK so I guess rather than trying to do some clever preallocation, we just
warn + get out?
> + } else {
> + /* Clean up the insertion of unfortunate the gap */
I'm not sure what this means :P 'unfortunate, the gap, isn't it?'
> + vms_complete_munmap_vmas(vms, mas_detach);
> + }
> +}
> +
> int
> do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> struct mm_struct *mm, unsigned long start,
> @@ -297,22 +335,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
> return mas_prev(&vmi->mas, min);
> }
>
> -static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> - struct vm_area_struct *vma, gfp_t gfp)
> -{
> - if (vmi->mas.status != ma_start &&
> - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> - vma_iter_invalidate(vmi);
> -
> - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> - mas_store_gfp(&vmi->mas, vma, gfp);
> - if (unlikely(mas_is_err(&vmi->mas)))
> - return -ENOMEM;
> -
> - return 0;
> -}
> -
> -
> /*
> * These three helpers classifies VMAs for virtual memory accounting.
> */
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 15/20] mm: Change failure of MAP_FIXED to restoring the gap on failure
2024-08-21 11:56 ` Lorenzo Stoakes
@ 2024-08-21 13:31 ` Liam R. Howlett
0 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-21 13:31 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [240821 07:56]:
> On Tue, Aug 20, 2024 at 07:57:24PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Prior to call_mmap(), the vmas that will be replaced need to clear the
> > way for what may happen in the call_mmap(). This clean up work includes
> > clearing the ptes and calling the close() vm_ops. Some users do more
> > setup than can be restored by calling the vm_ops open() function. It is
> > safer to store the gap in the vma tree in these cases.
> >
> > That is to say that the failure scenario that existed before the
> > MAP_FIXED gap exposure is restored as it is safer than trying to undo a
> > partial mapping.
> >
> > There is also a secondary failure that may occur if there is not enough
> > memory to store the gap. In this case, the vmas are reattached and
> > resources freed. If the system cannot complete the call_mmap() and
> > fails to allocate with GFP_KERNEL, then the system will print a warning
> > about the failure.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
>
> Just some nits etc. below, otherwise:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > ---
> > mm/mmap.c | 3 +--
> > mm/vma.h | 62 +++++++++++++++++++++++++++++++++++++------------------
> > 2 files changed, 43 insertions(+), 22 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 6550d9470d3a..c1b3d17f97be 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1622,8 +1622,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > vm_unacct_memory(charged);
> >
> > abort_munmap:
> > - if (vms.nr_pages)
> > - abort_munmap_vmas(&mas_detach, vms.closed_vm_ops);
> > + vms_abort_munmap_vmas(&vms, &mas_detach, vms.closed_vm_ops);
> > validate_mm(mm);
> > return error;
> > }
> > diff --git a/mm/vma.h b/mm/vma.h
> > index 756dd42a6ec4..7618ddbfd2b2 100644
> > --- a/mm/vma.h
> > +++ b/mm/vma.h
> > @@ -82,6 +82,22 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > unsigned long start, unsigned long end, pgoff_t pgoff);
> >
> > +static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > + struct vm_area_struct *vma, gfp_t gfp)
> > +
> > +{
> > + if (vmi->mas.status != ma_start &&
> > + ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> > + vma_iter_invalidate(vmi);
> > +
> > + __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > + mas_store_gfp(&vmi->mas, vma, gfp);
> > + if (unlikely(mas_is_err(&vmi->mas)))
> > + return -ENOMEM;
> > +
> > + return 0;
> > +}
> > +
> > /*
> > * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
> > * @vms: The vma munmap struct
> > @@ -136,15 +152,37 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach, bool closed)
> > struct vm_area_struct *vma;
> >
> > mas_set(mas_detach, 0);
> > - mas_for_each(mas_detach, vma, ULONG_MAX) {
> > + mas_for_each(mas_detach, vma, ULONG_MAX)
> > vma_mark_detached(vma, false);
> > - if (closed && vma->vm_ops && vma->vm_ops->open)
> > - vma->vm_ops->open(vma);
> > - }
>
> Yes.
Woops, closed isn't used anymore here.
>
> >
> > __mt_destroy(mas_detach->tree);
> > }
> >
> > +static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
> > + struct ma_state *mas_detach, bool closed)
>
> Nitty, but seems strange to comment abort_munmap_vmas() and say that it
> undoes any munmap work + frees resources, but not to comment this function.
Sure, I can add a comment, it's probably more useful than the one above.
>
> Also I wonder if, with these changes, abort_munmap_vmas() needs a rename?
> Something like reattach_vmas() or something?
I can rename it. This is exclusively used in vma.c now besides the
below use.
>
> > +{
> > + if (!vms->nr_pages)
> > + return;
> > +
> > + if (vms->clear_ptes)
> > + return abort_munmap_vmas(mas_detach, vms->closed_vm_ops);
> > +
> > + /*
> > + * Aborting cannot just call the vm_ops open() because they are often
> > + * not symmetrical and state data has been lost. Resort to the old
> > + * failure method of leaving a gap where the MAP_FIXED mapping failed.
> > + */
> > + if (unlikely(vma_iter_store_gfp(vms->vmi, NULL, GFP_KERNEL))) {
> > + pr_warn_once("%s: (%d) Unable to abort munmap() operation\n",
> > + current->comm, current->pid);
> > + /* Leaving vmas detached and in-tree may hamper recovery */
> > + abort_munmap_vmas(mas_detach, vms->closed_vm_ops);
>
> OK so I guess rather than trying to do some clever preallocation, we just
> warn + get out?
Hmm, clever preallocations... I could maybe figure out something here.
No double failure left unfixed!
>
> > + } else {
> > + /* Clean up the insertion of unfortunate the gap */
>
> I'm not sure what this means :P 'unfortunate, the gap, isn't it?'
Right. Perhaps I should update this to 'mind the gap'.
>
> > + vms_complete_munmap_vmas(vms, mas_detach);
> > + }
> > +}
> > +
> > int
> > do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > struct mm_struct *mm, unsigned long start,
> > @@ -297,22 +335,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
> > return mas_prev(&vmi->mas, min);
> > }
> >
> > -static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
> > - struct vm_area_struct *vma, gfp_t gfp)
> > -{
> > - if (vmi->mas.status != ma_start &&
> > - ((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
> > - vma_iter_invalidate(vmi);
> > -
> > - __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
> > - mas_store_gfp(&vmi->mas, vma, gfp);
> > - if (unlikely(mas_is_err(&vmi->mas)))
> > - return -ENOMEM;
> > -
> > - return 0;
> > -}
> > -
> > -
> > /*
> > * These three helpers classifies VMAs for virtual memory accounting.
> > */
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v6 16/20] mm/mmap: Use PHYS_PFN in mmap_region()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (14 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 15/20] mm: Change failure of MAP_FIXED to restoring the gap on failure Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 17/20] mm/mmap: Use vms accounted pages " Liam R. Howlett
` (3 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Instead of shifting the length by PAGE_SIZE, use PHYS_PFN. Also use the
existing local variable everywhere instead of some of the time.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/mmap.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index c1b3d17f97be..19dac138f913 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1364,7 +1364,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
struct vm_area_struct *next, *prev, *merge;
- pgoff_t pglen = len >> PAGE_SHIFT;
+ pgoff_t pglen = PHYS_PFN(len);
unsigned long charged = 0;
struct vma_munmap_struct vms;
struct ma_state mas_detach;
@@ -1384,7 +1384,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* MAP_FIXED may remove pages of mappings that intersects with requested
* mapping. Account for the pages it would unmap.
*/
- if (!may_expand_vm(mm, vm_flags, (len >> PAGE_SHIFT) - nr_pages))
+ if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
return -ENOMEM;
/* Find the first overlapping VMA */
@@ -1412,7 +1412,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* Private writable mapping: check memory availability
*/
if (accountable_mapping(file, vm_flags)) {
- charged = len >> PAGE_SHIFT;
+ charged = pglen;
charged -= nr_accounted;
if (security_vm_enough_memory_mm(mm, charged))
goto abort_munmap;
@@ -1573,14 +1573,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Unmap any existing mapping in the area */
vms_complete_munmap_vmas(&vms, &mas_detach);
- vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
+ vm_stat_account(mm, vm_flags, pglen);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
is_vm_hugetlb_page(vma) ||
vma == get_gate_vma(current->mm))
vm_flags_clear(vma, VM_LOCKED_MASK);
else
- mm->locked_vm += (len >> PAGE_SHIFT);
+ mm->locked_vm += pglen;
}
if (file)
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 17/20] mm/mmap: Use vms accounted pages in mmap_region()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (15 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 16/20] mm/mmap: Use PHYS_PFN in mmap_region() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-21 16:35 ` Paul Moore
2024-08-20 23:57 ` [PATCH v6 18/20] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
` (2 subsequent siblings)
19 siblings, 1 reply; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett,
linux-security-module
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Change from nr_pages variable to vms.nr_accounted for the charged pages
calculation. This is necessary for a future patch.
This also avoids checking security_vm_enough_memory_mm() if the amount
of memory won't change.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Kees Cook <kees@kernel.org>
Cc: linux-security-module@vger.kernel.org
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
---
mm/mmap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 19dac138f913..2a4f1df96f94 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1413,9 +1413,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
*/
if (accountable_mapping(file, vm_flags)) {
charged = pglen;
- charged -= nr_accounted;
- if (security_vm_enough_memory_mm(mm, charged))
+ charged -= vms.nr_accounted;
+ if (charged && security_vm_enough_memory_mm(mm, charged))
goto abort_munmap;
+
vms.nr_accounted = 0;
vm_flags |= VM_ACCOUNT;
}
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 17/20] mm/mmap: Use vms accounted pages in mmap_region()
2024-08-20 23:57 ` [PATCH v6 17/20] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-08-21 16:35 ` Paul Moore
2024-08-21 17:15 ` Liam R. Howlett
0 siblings, 1 reply; 32+ messages in thread
From: Paul Moore @ 2024-08-21 16:35 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Lorenzo Stoakes, Matthew Wilcox, Vlastimil Babka,
sidhartha.kumar, Bert Karwatzki, Jiri Olsa, Kees Cook,
Paul E . McKenney, linux-security-module
On Tue, Aug 20, 2024 at 8:02 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> Change from nr_pages variable to vms.nr_accounted for the charged pages
> calculation. This is necessary for a future patch.
>
> This also avoids checking security_vm_enough_memory_mm() if the amount
> of memory won't change.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Cc: Kees Cook <kees@kernel.org>
> Cc: linux-security-module@vger.kernel.org
> Reviewed-by: Kees Cook <kees@kernel.org>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> ---
> mm/mmap.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
I'm pretty sure I already ACK'd this, but I don't see it above so here
it is again:
Acked-by: Paul Moore <paul@paul-moore.com> (LSM)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 19dac138f913..2a4f1df96f94 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1413,9 +1413,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> */
> if (accountable_mapping(file, vm_flags)) {
> charged = pglen;
> - charged -= nr_accounted;
> - if (security_vm_enough_memory_mm(mm, charged))
> + charged -= vms.nr_accounted;
> + if (charged && security_vm_enough_memory_mm(mm, charged))
> goto abort_munmap;
> +
> vms.nr_accounted = 0;
> vm_flags |= VM_ACCOUNT;
> }
> --
> 2.43.0
--
paul-moore.com
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 17/20] mm/mmap: Use vms accounted pages in mmap_region()
2024-08-21 16:35 ` Paul Moore
@ 2024-08-21 17:15 ` Liam R. Howlett
0 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-21 17:15 UTC (permalink / raw)
To: Paul Moore
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Lorenzo Stoakes, Matthew Wilcox, Vlastimil Babka,
sidhartha.kumar, Bert Karwatzki, Jiri Olsa, Kees Cook,
Paul E . McKenney, linux-security-module
* Paul Moore <paul@paul-moore.com> [240821 12:35]:
> On Tue, Aug 20, 2024 at 8:02 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> >
> > Change from nr_pages variable to vms.nr_accounted for the charged pages
> > calculation. This is necessary for a future patch.
> >
> > This also avoids checking security_vm_enough_memory_mm() if the amount
> > of memory won't change.
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: linux-security-module@vger.kernel.org
> > Reviewed-by: Kees Cook <kees@kernel.org>
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > mm/mmap.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
>
> I'm pretty sure I already ACK'd this, but I don't see it above so here
> it is again:
>
> Acked-by: Paul Moore <paul@paul-moore.com> (LSM)
Sorry for missing that. It's here now for sure.
Thanks,
Liam
>
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 19dac138f913..2a4f1df96f94 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1413,9 +1413,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > */
> > if (accountable_mapping(file, vm_flags)) {
> > charged = pglen;
> > - charged -= nr_accounted;
> > - if (security_vm_enough_memory_mm(mm, charged))
> > + charged -= vms.nr_accounted;
> > + if (charged && security_vm_enough_memory_mm(mm, charged))
> > goto abort_munmap;
> > +
> > vms.nr_accounted = 0;
> > vm_flags |= VM_ACCOUNT;
> > }
> > --
> > 2.43.0
>
> --
> paul-moore.com
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v6 18/20] ipc/shm, mm: Drop do_vma_munmap()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (16 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 17/20] mm/mmap: Use vms accounted pages " Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-21 11:59 ` Lorenzo Stoakes
2024-08-20 23:57 ` [PATCH v6 19/20] mm: Move may_expand_vm() check in mmap_region() Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 20/20] mm/vma: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
19 siblings, 1 reply; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
The do_vma_munmap() wrapper existed for callers that didn't have a vma
iterator and needed to check the vma mseal status prior to calling the
underlying munmap(). All callers now use a vma iterator and since the
mseal check has been moved to do_vmi_align_munmap() and the vmas are
aligned, this function can just be called instead.
do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
it is exported via the mm.h header.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
include/linux/mm.h | 6 +++---
ipc/shm.c | 8 ++++----
mm/mmap.c | 29 ++++-------------------------
3 files changed, 11 insertions(+), 32 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b1eed30fdc06..c5a83d9d1110 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3292,14 +3292,14 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
unsigned long start, size_t len, struct list_head *uf,
bool unlock);
+extern int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ struct mm_struct *mm, unsigned long start,
+ unsigned long end, struct list_head *uf, bool unlock);
extern int do_munmap(struct mm_struct *, unsigned long, size_t,
struct list_head *uf);
extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
#ifdef CONFIG_MMU
-extern int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct list_head *uf, bool unlock);
extern int __mm_populate(unsigned long addr, unsigned long len,
int ignore_errors);
static inline void mm_populate(unsigned long addr, unsigned long len)
diff --git a/ipc/shm.c b/ipc/shm.c
index 3e3071252dac..99564c870084 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1778,8 +1778,8 @@ long ksys_shmdt(char __user *shmaddr)
*/
file = vma->vm_file;
size = i_size_read(file_inode(vma->vm_file));
- do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
- NULL, false);
+ do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
+ vma->vm_end, NULL, false);
/*
* We discovered the size of the shm segment, so
* break out of here and fall through to the next
@@ -1803,8 +1803,8 @@ long ksys_shmdt(char __user *shmaddr)
if ((vma->vm_ops == &shm_vm_ops) &&
((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
(vma->vm_file == file)) {
- do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
- NULL, false);
+ do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
+ vma->vm_end, NULL, false);
}
vma = vma_next(&vmi);
diff --git a/mm/mmap.c b/mm/mmap.c
index 2a4f1df96f94..49d9e95f42f5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -169,11 +169,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
goto out; /* mapping intersects with an existing non-brk vma. */
/*
* mm->brk must be protected by write mmap_lock.
- * do_vma_munmap() will drop the lock on success, so update it
- * before calling do_vma_munmap().
+ * do_vmi_align_munmap() will drop the lock on success, so
+ * update it before calling do_vma_munmap().
*/
mm->brk = brk;
- if (do_vma_munmap(&vmi, brkvma, newbrk, oldbrk, &uf, true))
+ if (do_vmi_align_munmap(&vmi, brkvma, mm, newbrk, oldbrk, &uf,
+ /* unlock = */ true))
goto out;
goto success_unlocked;
@@ -1742,28 +1743,6 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
return ret;
}
-/*
- * do_vma_munmap() - Unmap a full or partial vma.
- * @vmi: The vma iterator pointing at the vma
- * @vma: The first vma to be munmapped
- * @start: the start of the address to unmap
- * @end: The end of the address to unmap
- * @uf: The userfaultfd list_head
- * @unlock: Drop the lock on success
- *
- * unmaps a VMA mapping when the vma iterator is already in position.
- * Does not handle alignment.
- *
- * Return: 0 on success drops the lock of so directed, error on failure and will
- * still hold the lock.
- */
-int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long start, unsigned long end, struct list_head *uf,
- bool unlock)
-{
- return do_vmi_align_munmap(vmi, vma, vma->vm_mm, start, end, uf, unlock);
-}
-
/*
* do_brk_flags() - Increase the brk vma if the flags match.
* @vmi: The vma iterator
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v6 18/20] ipc/shm, mm: Drop do_vma_munmap()
2024-08-20 23:57 ` [PATCH v6 18/20] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
@ 2024-08-21 11:59 ` Lorenzo Stoakes
0 siblings, 0 replies; 32+ messages in thread
From: Lorenzo Stoakes @ 2024-08-21 11:59 UTC (permalink / raw)
To: Liam R. Howlett
Cc: Andrew Morton, linux-mm, linux-kernel, Suren Baghdasaryan,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney
On Tue, Aug 20, 2024 at 07:57:27PM GMT, Liam R. Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
>
> The do_vma_munmap() wrapper existed for callers that didn't have a vma
> iterator and needed to check the vma mseal status prior to calling the
> underlying munmap(). All callers now use a vma iterator and since the
> mseal check has been moved to do_vmi_align_munmap() and the vmas are
> aligned, this function can just be called instead.
>
> do_vmi_align_munmap() can no longer be static as ipc/shm is using it and
> it is exported via the mm.h header.
>
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
1 little nit, but generally:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/mm.h | 6 +++---
> ipc/shm.c | 8 ++++----
> mm/mmap.c | 29 ++++-------------------------
> 3 files changed, 11 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b1eed30fdc06..c5a83d9d1110 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3292,14 +3292,14 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
> extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
> unsigned long start, size_t len, struct list_head *uf,
> bool unlock);
> +extern int do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> + struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct list_head *uf, bool unlock);
Nit, as per the standard Vlasta one about removing 'extern' from these
declarations as we go ;)
> extern int do_munmap(struct mm_struct *, unsigned long, size_t,
> struct list_head *uf);
> extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
>
> #ifdef CONFIG_MMU
> -extern int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> - unsigned long start, unsigned long end,
> - struct list_head *uf, bool unlock);
> extern int __mm_populate(unsigned long addr, unsigned long len,
> int ignore_errors);
> static inline void mm_populate(unsigned long addr, unsigned long len)
> diff --git a/ipc/shm.c b/ipc/shm.c
> index 3e3071252dac..99564c870084 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -1778,8 +1778,8 @@ long ksys_shmdt(char __user *shmaddr)
> */
> file = vma->vm_file;
> size = i_size_read(file_inode(vma->vm_file));
> - do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
> - NULL, false);
> + do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
> + vma->vm_end, NULL, false);
> /*
> * We discovered the size of the shm segment, so
> * break out of here and fall through to the next
> @@ -1803,8 +1803,8 @@ long ksys_shmdt(char __user *shmaddr)
> if ((vma->vm_ops == &shm_vm_ops) &&
> ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
> (vma->vm_file == file)) {
> - do_vma_munmap(&vmi, vma, vma->vm_start, vma->vm_end,
> - NULL, false);
> + do_vmi_align_munmap(&vmi, vma, mm, vma->vm_start,
> + vma->vm_end, NULL, false);
> }
>
> vma = vma_next(&vmi);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 2a4f1df96f94..49d9e95f42f5 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -169,11 +169,12 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
> goto out; /* mapping intersects with an existing non-brk vma. */
> /*
> * mm->brk must be protected by write mmap_lock.
> - * do_vma_munmap() will drop the lock on success, so update it
> - * before calling do_vma_munmap().
> + * do_vmi_align_munmap() will drop the lock on success, so
> + * update it before calling do_vma_munmap().
> */
> mm->brk = brk;
> - if (do_vma_munmap(&vmi, brkvma, newbrk, oldbrk, &uf, true))
> + if (do_vmi_align_munmap(&vmi, brkvma, mm, newbrk, oldbrk, &uf,
> + /* unlock = */ true))
> goto out;
>
> goto success_unlocked;
> @@ -1742,28 +1743,6 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
> return ret;
> }
>
> -/*
> - * do_vma_munmap() - Unmap a full or partial vma.
> - * @vmi: The vma iterator pointing at the vma
> - * @vma: The first vma to be munmapped
> - * @start: the start of the address to unmap
> - * @end: The end of the address to unmap
> - * @uf: The userfaultfd list_head
> - * @unlock: Drop the lock on success
> - *
> - * unmaps a VMA mapping when the vma iterator is already in position.
> - * Does not handle alignment.
> - *
> - * Return: 0 on success drops the lock of so directed, error on failure and will
> - * still hold the lock.
> - */
> -int do_vma_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> - unsigned long start, unsigned long end, struct list_head *uf,
> - bool unlock)
> -{
> - return do_vmi_align_munmap(vmi, vma, vma->vm_mm, start, end, uf, unlock);
> -}
> -
> /*
> * do_brk_flags() - Increase the brk vma if the flags match.
> * @vmi: The vma iterator
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v6 19/20] mm: Move may_expand_vm() check in mmap_region()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (17 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 18/20] ipc/shm, mm: Drop do_vma_munmap() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
2024-08-20 23:57 ` [PATCH v6 20/20] mm/vma: Drop incorrect comment from vms_gather_munmap_vmas() Liam R. Howlett
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
The may_expand_vm() check requires the count of the pages within the
munmap range. Since this is needed for accounting and obtained later,
the reodering of ma_expand_vm() to later in the call stack, after the
vma munmap struct (vms) is initialised and the gather stage is
potentially run, will allow for a single loop over the vmas. The gather
sage does not commit any work and so everything can be undone in the
case of a failure.
The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
call, so use it instead of looping over the vmas twice.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 15 ++++-----------
mm/vma.c | 21 ---------------------
mm/vma.h | 3 ---
3 files changed, 4 insertions(+), 35 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 49d9e95f42f5..012b3495c266 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1376,17 +1376,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
pgoff_t vm_pgoff;
int error = -ENOMEM;
VMA_ITERATOR(vmi, mm, addr);
- unsigned long nr_pages, nr_accounted;
-
- nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
-
- /*
- * Check against address space limit.
- * MAP_FIXED may remove pages of mappings that intersects with requested
- * mapping. Account for the pages it would unmap.
- */
- if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
- return -ENOMEM;
/* Find the first overlapping VMA */
vma = vma_find(&vmi, end);
@@ -1409,6 +1398,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma_iter_next_range(&vmi);
}
+ /* Check against address space limit. */
+ if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
+ goto abort_munmap;
+
/*
* Private writable mapping: check memory availability
*/
diff --git a/mm/vma.c b/mm/vma.c
index 5b33f7460ab7..f277ab1b0175 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1643,27 +1643,6 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot)
return vma_fs_can_writeback(vma);
}
-unsigned long count_vma_pages_range(struct mm_struct *mm,
- unsigned long addr, unsigned long end,
- unsigned long *nr_accounted)
-{
- VMA_ITERATOR(vmi, mm, addr);
- struct vm_area_struct *vma;
- unsigned long nr_pages = 0;
-
- *nr_accounted = 0;
- for_each_vma_range(vmi, vma, end) {
- unsigned long vm_start = max(addr, vma->vm_start);
- unsigned long vm_end = min(end, vma->vm_end);
-
- nr_pages += PHYS_PFN(vm_end - vm_start);
- if (vma->vm_flags & VM_ACCOUNT)
- *nr_accounted += PHYS_PFN(vm_end - vm_start);
- }
-
- return nr_pages;
-}
-
static DEFINE_MUTEX(mm_all_locks_mutex);
static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
diff --git a/mm/vma.h b/mm/vma.h
index 7618ddbfd2b2..f8b4d3375a5b 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -305,9 +305,6 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
int mm_take_all_locks(struct mm_struct *mm);
void mm_drop_all_locks(struct mm_struct *mm);
-unsigned long count_vma_pages_range(struct mm_struct *mm,
- unsigned long addr, unsigned long end,
- unsigned long *nr_accounted);
static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
{
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread* [PATCH v6 20/20] mm/vma: Drop incorrect comment from vms_gather_munmap_vmas()
2024-08-20 23:57 [PATCH v6 00/20] Avoid MAP_FIXED gap exposure Liam R. Howlett
` (18 preceding siblings ...)
2024-08-20 23:57 ` [PATCH v6 19/20] mm: Move may_expand_vm() check in mmap_region() Liam R. Howlett
@ 2024-08-20 23:57 ` Liam R. Howlett
19 siblings, 0 replies; 32+ messages in thread
From: Liam R. Howlett @ 2024-08-20 23:57 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, Suren Baghdasaryan, Lorenzo Stoakes,
Matthew Wilcox, Vlastimil Babka, sidhartha.kumar, Bert Karwatzki,
Jiri Olsa, Kees Cook, Paul E . McKenney, Liam R. Howlett
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
The comment has been outdated since 6b73cff239e52 ("mm: change munmap
splitting order and move_vma()"). The move_vma() was altered to fix the
fragile state of the accounting since then.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/vma.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/mm/vma.c b/mm/vma.c
index f277ab1b0175..dd064b0abe17 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -750,13 +750,8 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
/*
* If we need to split any vma, do it now to save pain later.
- *
- * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
- * unmapped vm_area_struct will remain in use: so lower split_vma
- * places tmp vma above, and higher split_vma places tmp vma below.
+ * Does it split the first one?
*/
-
- /* Does it split the first one? */
if (vms->start > vms->vma->vm_start) {
/*
--
2.43.0
^ permalink raw reply [flat|nested] 32+ messages in thread