* [PATCH 0/5] mm/vma: make more mmap logic userland testable
@ 2024-12-03 18:05 Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c Lorenzo Stoakes
` (5 more replies)
0 siblings, 6 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-03 18:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel, linux-kernel
This series carries on the work the work started in previous series and
continued in commit 52956b0d7fb9 ("mm: isolate mmap internal logic to
mm/vma.c"), moving the remainder of memory mapping implementation details
logic into mm/vma.c allowing the bulk of the mapping logic to be unit
tested.
It is highly useful to do so, as this means we can both fundamentally test
this core logic, and introduce regression tests to ensure any issues
previously resolved do not recur.
Vitally, this includes the do_brk_flags() function, meaning we have both
core means of userland mapping memory now testable.
Performance testing was performed after this change given the brk() system
call's sensitivity to change, and no performance regression was observed.
The stack expansion logic is also moved into mm/vma.c, which necessitates a
change in the API exposed to the exec code, removing the invocation of the
expand_downwards() function used in get_arg_page() and instead adding
mmap_read_lock_maybe_expand() to wrap this.
Lorenzo Stoakes (5):
mm/vma: move brk() internals to mm/vma.c
mm/vma: move unmapped_area() internals to mm/vma.c
mm: abstract get_arg_page() stack expansion and mmap read lock
mm/vma: move stack expansion logic to mm/vma.c
mm/vma: move __vm_munmap() to mm/vma.c
fs/exec.c | 14 +-
include/linux/mm.h | 5 +-
mm/mmap.c | 469 ++++--------------------------
mm/vma.c | 478 ++++++++++++++++++++++++++++---
mm/vma.h | 20 +-
tools/testing/vma/vma.c | 11 +
tools/testing/vma/vma_internal.h | 152 ++++++++++
7 files changed, 681 insertions(+), 468 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c
2024-12-03 18:05 [PATCH 0/5] mm/vma: make more mmap logic userland testable Lorenzo Stoakes
@ 2024-12-03 18:05 ` Lorenzo Stoakes
2024-12-04 11:55 ` kernel test robot
` (2 more replies)
2024-12-03 18:05 ` [PATCH 2/5] mm/vma: move unmapped_area() " Lorenzo Stoakes
` (4 subsequent siblings)
5 siblings, 3 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-03 18:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel, linux-kernel
Now we have moved mmap_region() internals to mm/vma.c, making it available
to userland testing, it makes sense to do the same with brk().
This continues the pattern of VMA heavy lifting being done in mm/vma.c in
an environment where it can be subject to straightforward unit and
regression testing, with other VMA-adjacent files becoming wrappers around
this functionality.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 85 +-------------------------------
mm/vma.c | 82 ++++++++++++++++++++++++++++++
mm/vma.h | 3 ++
tools/testing/vma/vma_internal.h | 22 +++++++++
4 files changed, 108 insertions(+), 84 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 16f8e8be01f8..93188ef46dae 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -111,8 +111,7 @@ static int check_brk_limits(unsigned long addr, unsigned long len)
return mlock_future_ok(current->mm, current->mm->def_flags, len)
? 0 : -EAGAIN;
}
-static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,
- unsigned long addr, unsigned long request, unsigned long flags);
+
SYSCALL_DEFINE1(brk, unsigned long, brk)
{
unsigned long newbrk, oldbrk, origbrk;
@@ -1512,88 +1511,6 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
return ret;
}
-/*
- * do_brk_flags() - Increase the brk vma if the flags match.
- * @vmi: The vma iterator
- * @addr: The start address
- * @len: The length of the increase
- * @vma: The vma,
- * @flags: The VMA Flags
- *
- * Extend the brk VMA from addr to addr + len. If the VMA is NULL or the flags
- * do not match then create a new anonymous VMA. Eventually we may be able to
- * do some brk-specific accounting here.
- */
-static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
- unsigned long addr, unsigned long len, unsigned long flags)
-{
- struct mm_struct *mm = current->mm;
-
- /*
- * Check against address space limits by the changed size
- * Note: This happens *after* clearing old mappings in some code paths.
- */
- flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
- if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
- return -ENOMEM;
-
- if (mm->map_count > sysctl_max_map_count)
- return -ENOMEM;
-
- if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
- return -ENOMEM;
-
- /*
- * Expand the existing vma if possible; Note that singular lists do not
- * occur after forking, so the expand will only happen on new VMAs.
- */
- if (vma && vma->vm_end == addr) {
- VMG_STATE(vmg, mm, vmi, addr, addr + len, flags, PHYS_PFN(addr));
-
- vmg.prev = vma;
- /* vmi is positioned at prev, which this mode expects. */
- vmg.merge_flags = VMG_FLAG_JUST_EXPAND;
-
- if (vma_merge_new_range(&vmg))
- goto out;
- else if (vmg_nomem(&vmg))
- goto unacct_fail;
- }
-
- if (vma)
- vma_iter_next_range(vmi);
- /* create a vma struct for an anonymous mapping */
- vma = vm_area_alloc(mm);
- if (!vma)
- goto unacct_fail;
-
- vma_set_anonymous(vma);
- vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
- vm_flags_init(vma, flags);
- vma->vm_page_prot = vm_get_page_prot(flags);
- vma_start_write(vma);
- if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
- goto mas_store_fail;
-
- mm->map_count++;
- validate_mm(mm);
- ksm_add_vma(vma);
-out:
- perf_event_mmap(vma);
- mm->total_vm += len >> PAGE_SHIFT;
- mm->data_vm += len >> PAGE_SHIFT;
- if (flags & VM_LOCKED)
- mm->locked_vm += (len >> PAGE_SHIFT);
- vm_flags_set(vma, VM_SOFTDIRTY);
- return 0;
-
-mas_store_fail:
- vm_area_free(vma);
-unacct_fail:
- vm_unacct_memory(len >> PAGE_SHIFT);
- return -ENOMEM;
-}
-
int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
{
struct mm_struct *mm = current->mm;
diff --git a/mm/vma.c b/mm/vma.c
index 8e31b7e25aeb..9955b5332ca2 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2478,3 +2478,85 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
return error;
}
+
+/*
+ * do_brk_flags() - Increase the brk vma if the flags match.
+ * @vmi: The vma iterator
+ * @addr: The start address
+ * @len: The length of the increase
+ * @vma: The vma,
+ * @flags: The VMA Flags
+ *
+ * Extend the brk VMA from addr to addr + len. If the VMA is NULL or the flags
+ * do not match then create a new anonymous VMA. Eventually we may be able to
+ * do some brk-specific accounting here.
+ */
+int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
+ unsigned long addr, unsigned long len, unsigned long flags)
+{
+ struct mm_struct *mm = current->mm;
+
+ /*
+ * Check against address space limits by the changed size
+ * Note: This happens *after* clearing old mappings in some code paths.
+ */
+ flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
+ if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
+ return -ENOMEM;
+
+ if (mm->map_count > sysctl_max_map_count)
+ return -ENOMEM;
+
+ if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
+ return -ENOMEM;
+
+ /*
+ * Expand the existing vma if possible; Note that singular lists do not
+ * occur after forking, so the expand will only happen on new VMAs.
+ */
+ if (vma && vma->vm_end == addr) {
+ VMG_STATE(vmg, mm, vmi, addr, addr + len, flags, PHYS_PFN(addr));
+
+ vmg.prev = vma;
+ /* vmi is positioned at prev, which this mode expects. */
+ vmg.merge_flags = VMG_FLAG_JUST_EXPAND;
+
+ if (vma_merge_new_range(&vmg))
+ goto out;
+ else if (vmg_nomem(&vmg))
+ goto unacct_fail;
+ }
+
+ if (vma)
+ vma_iter_next_range(vmi);
+ /* create a vma struct for an anonymous mapping */
+ vma = vm_area_alloc(mm);
+ if (!vma)
+ goto unacct_fail;
+
+ vma_set_anonymous(vma);
+ vma_set_range(vma, addr, addr + len, addr >> PAGE_SHIFT);
+ vm_flags_init(vma, flags);
+ vma->vm_page_prot = vm_get_page_prot(flags);
+ vma_start_write(vma);
+ if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
+ goto mas_store_fail;
+
+ mm->map_count++;
+ validate_mm(mm);
+ ksm_add_vma(vma);
+out:
+ perf_event_mmap(vma);
+ mm->total_vm += len >> PAGE_SHIFT;
+ mm->data_vm += len >> PAGE_SHIFT;
+ if (flags & VM_LOCKED)
+ mm->locked_vm += (len >> PAGE_SHIFT);
+ vm_flags_set(vma, VM_SOFTDIRTY);
+ return 0;
+
+mas_store_fail:
+ vm_area_free(vma);
+unacct_fail:
+ vm_unacct_memory(len >> PAGE_SHIFT);
+ return -ENOMEM;
+}
diff --git a/mm/vma.h b/mm/vma.h
index 388d34748674..83a15d3a8285 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -247,6 +247,9 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf);
+int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,
+ unsigned long addr, unsigned long request, unsigned long flags);
+
static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
{
/*
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index e76ff579e1fd..7c3c15135c5b 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -39,6 +39,7 @@
#define VM_SHARED 0x00000008
#define VM_MAYREAD 0x00000010
#define VM_MAYWRITE 0x00000020
+#define VM_MAYEXEC 0x00000040
#define VM_GROWSDOWN 0x00000100
#define VM_PFNMAP 0x00000400
#define VM_LOCKED 0x00002000
@@ -58,6 +59,13 @@
/* This mask represents all the VMA flag bits used by mlock */
#define VM_LOCKED_MASK (VM_LOCKED | VM_LOCKONFAULT)
+#define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
+
+#define VM_DATA_FLAGS_TSK_EXEC (VM_READ | VM_WRITE | TASK_EXEC | \
+ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
+
#ifdef CONFIG_64BIT
/* VM is sealed, in vm_flags */
#define VM_SEALED _BITUL(63)
@@ -122,10 +130,22 @@ enum {
TASK_COMM_LEN = 16,
};
+/*
+ * Flags for bug emulation.
+ *
+ * These occupy the top three bytes.
+ */
+enum {
+ READ_IMPLIES_EXEC = 0x0400000,
+};
+
struct task_struct {
char comm[TASK_COMM_LEN];
pid_t pid;
struct mm_struct *mm;
+
+ /* Used for emulating ABI behavior of previous Linux versions: */
+ unsigned int personality;
};
struct task_struct *get_current(void);
@@ -186,6 +206,8 @@ struct mm_struct {
unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
unsigned long stack_vm; /* VM_STACK */
+
+ unsigned long def_flags;
};
struct vma_lock {
--
2.47.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 2/5] mm/vma: move unmapped_area() internals to mm/vma.c
2024-12-03 18:05 [PATCH 0/5] mm/vma: make more mmap logic userland testable Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c Lorenzo Stoakes
@ 2024-12-03 18:05 ` Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock Lorenzo Stoakes
` (3 subsequent siblings)
5 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-03 18:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel, linux-kernel
We want to be able to unit test the unmapped area logic, so move it to
mm/vma.c. The wrappers which invoke this remain in place in mm/mmap.c.
In addition, naturally, update the existing test code to enable this to be
compiled in userland.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 109 -------------------------------
mm/vma.c | 109 +++++++++++++++++++++++++++++++
mm/vma.h | 3 +
tools/testing/vma/vma.c | 6 ++
tools/testing/vma/vma_internal.h | 59 +++++++++++++++++
5 files changed, 177 insertions(+), 109 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 93188ef46dae..f053de1d6fae 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -580,115 +580,6 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
}
#endif /* __ARCH_WANT_SYS_OLD_MMAP */
-/**
- * unmapped_area() - Find an area between the low_limit and the high_limit with
- * the correct alignment and offset, all from @info. Note: current->mm is used
- * for the search.
- *
- * @info: The unmapped area information including the range [low_limit -
- * high_limit), the alignment offset and mask.
- *
- * Return: A memory address or -ENOMEM.
- */
-static unsigned long unmapped_area(struct vm_unmapped_area_info *info)
-{
- unsigned long length, gap;
- unsigned long low_limit, high_limit;
- struct vm_area_struct *tmp;
- VMA_ITERATOR(vmi, current->mm, 0);
-
- /* Adjust search length to account for worst case alignment overhead */
- length = info->length + info->align_mask + info->start_gap;
- if (length < info->length)
- return -ENOMEM;
-
- low_limit = info->low_limit;
- if (low_limit < mmap_min_addr)
- low_limit = mmap_min_addr;
- high_limit = info->high_limit;
-retry:
- if (vma_iter_area_lowest(&vmi, low_limit, high_limit, length))
- return -ENOMEM;
-
- /*
- * Adjust for the gap first so it doesn't interfere with the
- * later alignment. The first step is the minimum needed to
- * fulill the start gap, the next steps is the minimum to align
- * that. It is the minimum needed to fulill both.
- */
- gap = vma_iter_addr(&vmi) + info->start_gap;
- gap += (info->align_offset - gap) & info->align_mask;
- tmp = vma_next(&vmi);
- if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
- if (vm_start_gap(tmp) < gap + length - 1) {
- low_limit = tmp->vm_end;
- vma_iter_reset(&vmi);
- goto retry;
- }
- } else {
- tmp = vma_prev(&vmi);
- if (tmp && vm_end_gap(tmp) > gap) {
- low_limit = vm_end_gap(tmp);
- vma_iter_reset(&vmi);
- goto retry;
- }
- }
-
- return gap;
-}
-
-/**
- * unmapped_area_topdown() - Find an area between the low_limit and the
- * high_limit with the correct alignment and offset at the highest available
- * address, all from @info. Note: current->mm is used for the search.
- *
- * @info: The unmapped area information including the range [low_limit -
- * high_limit), the alignment offset and mask.
- *
- * Return: A memory address or -ENOMEM.
- */
-static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
-{
- unsigned long length, gap, gap_end;
- unsigned long low_limit, high_limit;
- struct vm_area_struct *tmp;
- VMA_ITERATOR(vmi, current->mm, 0);
-
- /* Adjust search length to account for worst case alignment overhead */
- length = info->length + info->align_mask + info->start_gap;
- if (length < info->length)
- return -ENOMEM;
-
- low_limit = info->low_limit;
- if (low_limit < mmap_min_addr)
- low_limit = mmap_min_addr;
- high_limit = info->high_limit;
-retry:
- if (vma_iter_area_highest(&vmi, low_limit, high_limit, length))
- return -ENOMEM;
-
- gap = vma_iter_end(&vmi) - info->length;
- gap -= (gap - info->align_offset) & info->align_mask;
- gap_end = vma_iter_end(&vmi);
- tmp = vma_next(&vmi);
- if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
- if (vm_start_gap(tmp) < gap_end) {
- high_limit = vm_start_gap(tmp);
- vma_iter_reset(&vmi);
- goto retry;
- }
- } else {
- tmp = vma_prev(&vmi);
- if (tmp && vm_end_gap(tmp) > gap) {
- high_limit = tmp->vm_start;
- vma_iter_reset(&vmi);
- goto retry;
- }
- }
-
- return gap;
-}
-
/*
* Determine if the allocation needs to ensure that there is no
* existing mapping within it's guard gaps, for use as start_gap.
diff --git a/mm/vma.c b/mm/vma.c
index 9955b5332ca2..50c0c9c443d2 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2560,3 +2560,112 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
vm_unacct_memory(len >> PAGE_SHIFT);
return -ENOMEM;
}
+
+/**
+ * unmapped_area() - Find an area between the low_limit and the high_limit with
+ * the correct alignment and offset, all from @info. Note: current->mm is used
+ * for the search.
+ *
+ * @info: The unmapped area information including the range [low_limit -
+ * high_limit), the alignment offset and mask.
+ *
+ * Return: A memory address or -ENOMEM.
+ */
+unsigned long unmapped_area(struct vm_unmapped_area_info *info)
+{
+ unsigned long length, gap;
+ unsigned long low_limit, high_limit;
+ struct vm_area_struct *tmp;
+ VMA_ITERATOR(vmi, current->mm, 0);
+
+ /* Adjust search length to account for worst case alignment overhead */
+ length = info->length + info->align_mask + info->start_gap;
+ if (length < info->length)
+ return -ENOMEM;
+
+ low_limit = info->low_limit;
+ if (low_limit < mmap_min_addr)
+ low_limit = mmap_min_addr;
+ high_limit = info->high_limit;
+retry:
+ if (vma_iter_area_lowest(&vmi, low_limit, high_limit, length))
+ return -ENOMEM;
+
+ /*
+ * Adjust for the gap first so it doesn't interfere with the
+ * later alignment. The first step is the minimum needed to
+ * fulill the start gap, the next steps is the minimum to align
+ * that. It is the minimum needed to fulill both.
+ */
+ gap = vma_iter_addr(&vmi) + info->start_gap;
+ gap += (info->align_offset - gap) & info->align_mask;
+ tmp = vma_next(&vmi);
+ if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
+ if (vm_start_gap(tmp) < gap + length - 1) {
+ low_limit = tmp->vm_end;
+ vma_iter_reset(&vmi);
+ goto retry;
+ }
+ } else {
+ tmp = vma_prev(&vmi);
+ if (tmp && vm_end_gap(tmp) > gap) {
+ low_limit = vm_end_gap(tmp);
+ vma_iter_reset(&vmi);
+ goto retry;
+ }
+ }
+
+ return gap;
+}
+
+/**
+ * unmapped_area_topdown() - Find an area between the low_limit and the
+ * high_limit with the correct alignment and offset at the highest available
+ * address, all from @info. Note: current->mm is used for the search.
+ *
+ * @info: The unmapped area information including the range [low_limit -
+ * high_limit), the alignment offset and mask.
+ *
+ * Return: A memory address or -ENOMEM.
+ */
+unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
+{
+ unsigned long length, gap, gap_end;
+ unsigned long low_limit, high_limit;
+ struct vm_area_struct *tmp;
+ VMA_ITERATOR(vmi, current->mm, 0);
+
+ /* Adjust search length to account for worst case alignment overhead */
+ length = info->length + info->align_mask + info->start_gap;
+ if (length < info->length)
+ return -ENOMEM;
+
+ low_limit = info->low_limit;
+ if (low_limit < mmap_min_addr)
+ low_limit = mmap_min_addr;
+ high_limit = info->high_limit;
+retry:
+ if (vma_iter_area_highest(&vmi, low_limit, high_limit, length))
+ return -ENOMEM;
+
+ gap = vma_iter_end(&vmi) - info->length;
+ gap -= (gap - info->align_offset) & info->align_mask;
+ gap_end = vma_iter_end(&vmi);
+ tmp = vma_next(&vmi);
+ if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
+ if (vm_start_gap(tmp) < gap_end) {
+ high_limit = vm_start_gap(tmp);
+ vma_iter_reset(&vmi);
+ goto retry;
+ }
+ } else {
+ tmp = vma_prev(&vmi);
+ if (tmp && vm_end_gap(tmp) > gap) {
+ high_limit = tmp->vm_start;
+ vma_iter_reset(&vmi);
+ goto retry;
+ }
+ }
+
+ return gap;
+}
diff --git a/mm/vma.h b/mm/vma.h
index 83a15d3a8285..c60f37d89eb1 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -250,6 +250,9 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,
unsigned long addr, unsigned long request, unsigned long flags);
+unsigned long unmapped_area(struct vm_unmapped_area_info *info);
+unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
+
static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
{
/*
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 8fab5e13c7c3..39ee61e55634 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -18,6 +18,12 @@ static bool fail_prealloc;
#define vma_iter_prealloc(vmi, vma) \
(fail_prealloc ? -ENOMEM : mas_preallocate(&(vmi)->mas, (vma), GFP_KERNEL))
+#define CONFIG_DEFAULT_MMAP_MIN_ADDR 65536
+
+unsigned long mmap_min_addr = CONFIG_DEFAULT_MMAP_MIN_ADDR;
+unsigned long dac_mmap_min_addr = CONFIG_DEFAULT_MMAP_MIN_ADDR;
+unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
+
/*
* Directly import the VMA implementation here. Our vma_internal.h wrapper
* provides userland-equivalent functionality for everything vma.c uses.
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 7c3c15135c5b..6ad8bd8edaad 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -27,6 +27,15 @@
#include <linux/rbtree.h>
#include <linux/rwsem.h>
+extern unsigned long stack_guard_gap;
+#ifdef CONFIG_MMU
+extern unsigned long mmap_min_addr;
+extern unsigned long dac_mmap_min_addr;
+#else
+#define mmap_min_addr 0UL
+#define dac_mmap_min_addr 0UL
+#endif
+
#define VM_WARN_ON(_expr) (WARN_ON(_expr))
#define VM_WARN_ON_ONCE(_expr) (WARN_ON_ONCE(_expr))
#define VM_BUG_ON(_expr) (BUG_ON(_expr))
@@ -52,6 +61,8 @@
#define VM_STACK VM_GROWSDOWN
#define VM_SHADOW_STACK VM_NONE
#define VM_SOFTDIRTY 0
+#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
+#define VM_GROWSUP VM_NONE
#define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
@@ -66,6 +77,8 @@
#define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
+#define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK)
+
#ifdef CONFIG_64BIT
/* VM is sealed, in vm_flags */
#define VM_SEALED _BITUL(63)
@@ -395,6 +408,17 @@ struct vm_operations_struct {
unsigned long addr);
};
+struct vm_unmapped_area_info {
+#define VM_UNMAPPED_AREA_TOPDOWN 1
+ unsigned long flags;
+ unsigned long length;
+ unsigned long low_limit;
+ unsigned long high_limit;
+ unsigned long align_mask;
+ unsigned long align_offset;
+ unsigned long start_gap;
+};
+
static inline void vma_iter_invalidate(struct vma_iterator *vmi)
{
mas_pause(&vmi->mas);
@@ -1055,4 +1079,39 @@ static inline int mmap_file(struct file *, struct vm_area_struct *)
return 0;
}
+static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & VM_GROWSDOWN)
+ return stack_guard_gap;
+
+ /* See reasoning around the VM_SHADOW_STACK definition */
+ if (vma->vm_flags & VM_SHADOW_STACK)
+ return PAGE_SIZE;
+
+ return 0;
+}
+
+static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
+{
+ unsigned long gap = stack_guard_start_gap(vma);
+ unsigned long vm_start = vma->vm_start;
+
+ vm_start -= gap;
+ if (vm_start > vma->vm_start)
+ vm_start = 0;
+ return vm_start;
+}
+
+static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
+{
+ unsigned long vm_end = vma->vm_end;
+
+ if (vma->vm_flags & VM_GROWSUP) {
+ vm_end += stack_guard_gap;
+ if (vm_end < vma->vm_end)
+ vm_end = -PAGE_SIZE;
+ }
+ return vm_end;
+}
+
#endif /* __MM_VMA_INTERNAL_H */
--
2.47.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock
2024-12-03 18:05 [PATCH 0/5] mm/vma: make more mmap logic userland testable Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 2/5] mm/vma: move unmapped_area() " Lorenzo Stoakes
@ 2024-12-03 18:05 ` Lorenzo Stoakes
2024-12-05 0:18 ` Wei Yang
` (2 more replies)
2024-12-03 18:05 ` [PATCH 4/5] mm/vma: move stack expansion logic to mm/vma.c Lorenzo Stoakes
` (2 subsequent siblings)
5 siblings, 3 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-03 18:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel, linux-kernel
Right now fs/exec.c invokes expand_downwards(), an otherwise internal
implementation detail of the VMA logic in order to ensure that an arg page
can be obtained by get_user_pages_remote().
In order to be able to move the stack expansion logic into mm/vma.c in
order to make it available to userland testing we need to find an
alternative approach here.
We do so by providing the mmap_read_lock_maybe_expand() function which also
helpfully documents what get_arg_page() is doing here and adds an
additional check against VM_GROWSDOWN to make explicit that the stack
expansion logic is only invoked when the VMA is indeed a downward-growing
stack.
This allows expand_downwards() to become a static function.
Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be
currently user-visible in any way, that is place within an rmap or VMA
tree. It must be a newly allocated VMA.
This is the case when exec invokes this function.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
fs/exec.c | 14 +++---------
include/linux/mm.h | 5 ++---
mm/mmap.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 58 insertions(+), 15 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 98cb7ba9983c..1e1f79c514de 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -205,18 +205,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
/*
* Avoid relying on expanding the stack down in GUP (which
* does not work for STACK_GROWSUP anyway), and just do it
- * by hand ahead of time.
+ * ahead of time.
*/
- if (write && pos < vma->vm_start) {
- mmap_write_lock(mm);
- ret = expand_downwards(vma, pos);
- if (unlikely(ret < 0)) {
- mmap_write_unlock(mm);
- return NULL;
- }
- mmap_write_downgrade(mm);
- } else
- mmap_read_lock(mm);
+ if (!mmap_read_lock_maybe_expand(mm, vma, pos, write))
+ return NULL;
/*
* We are doing an exec(). 'current' is the process
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4eb8e62d5c67..48312a934454 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3313,6 +3313,8 @@ extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admi
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
extern void exit_mmap(struct mm_struct *);
int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
+bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, bool write);
static inline int check_data_rlimit(unsigned long rlim,
unsigned long new,
@@ -3426,9 +3428,6 @@ extern unsigned long stack_guard_gap;
int expand_stack_locked(struct vm_area_struct *vma, unsigned long address);
struct vm_area_struct *expand_stack(struct mm_struct * mm, unsigned long addr);
-/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */
-int expand_downwards(struct vm_area_struct *vma, unsigned long address);
-
/* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
diff --git a/mm/mmap.c b/mm/mmap.c
index f053de1d6fae..4df38d3717ff 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1009,7 +1009,7 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
* vma is the first one with address < vma->vm_start. Have to extend vma.
* mmap_lock held for writing.
*/
-int expand_downwards(struct vm_area_struct *vma, unsigned long address)
+static int expand_downwards(struct vm_area_struct *vma, unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *prev;
@@ -1940,3 +1940,55 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
/* Shrink the vma to just the new range */
return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
}
+
+#ifdef CONFIG_MMU
+/*
+ * Obtain a read lock on mm->mmap_lock, if the specified address is below the
+ * start of the VMA, the intent is to perform a write, and it is a
+ * downward-growing stack, then attempt to expand the stack to contain it.
+ *
+ * This function is intended only for obtaining an argument page from an ELF
+ * image, and is almost certainly NOT what you want to use for any other
+ * purpose.
+ *
+ * IMPORTANT - VMA fields are accessed without an mmap lock being held, so the
+ * VMA referenced must not be linked in any user-visible tree, i.e. it must be a
+ * new VMA being mapped.
+ *
+ * The function assumes that addr is either contained within the VMA or below
+ * it, and makes no attempt to validate this value beyond that.
+ *
+ * Returns true if the read lock was obtained and a stack was perhaps expanded,
+ * false if the stack expansion failed.
+ *
+ * On stack expansion the function temporarily acquires an mmap write lock
+ * before downgrading it.
+ */
+bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
+ struct vm_area_struct *new_vma,
+ unsigned long addr, bool write)
+{
+ if (!write || addr >= new_vma->vm_start) {
+ mmap_read_lock(mm);
+ return true;
+ }
+
+ if (!(new_vma->vm_flags & VM_GROWSDOWN))
+ return false;
+
+ mmap_write_lock(mm);
+ if (expand_downwards(new_vma, addr)) {
+ mmap_write_unlock(mm);
+ return false;
+ }
+
+ mmap_write_downgrade(mm);
+ return true;
+}
+#else
+bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, bool write)
+{
+ return false;
+}
+#endif
--
2.47.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 4/5] mm/vma: move stack expansion logic to mm/vma.c
2024-12-03 18:05 [PATCH 0/5] mm/vma: make more mmap logic userland testable Lorenzo Stoakes
` (2 preceding siblings ...)
2024-12-03 18:05 ` [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock Lorenzo Stoakes
@ 2024-12-03 18:05 ` Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 5/5] mm/vma: move __vm_munmap() " Lorenzo Stoakes
2024-12-04 23:56 ` [PATCH 0/5] mm/vma: make more mmap logic userland testable Wei Yang
5 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-03 18:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel, linux-kernel
We build on previous work making expand_downwards() an entirely internal
function.
This logic is subtle and so it is highly useful to get it into vma.c so we
can then userland unit test.
We must additionally move acct_stack_growth() to vma.c as it is a helper
function used by both expand_downwards() and expand_upwards().
We are also then able to mark anon_vma_interval_tree_pre_update_vma() and
anon_vma_interval_tree_post_update_vma() static as these are no longer used
by anything else.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 205 -----------------------
mm/vma.c | 269 +++++++++++++++++++++++++++----
mm/vma.h | 12 +-
tools/testing/vma/vma.c | 5 +
tools/testing/vma/vma_internal.h | 62 +++++++
5 files changed, 310 insertions(+), 243 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 4df38d3717ff..55a8f2332b7c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -879,211 +879,6 @@ find_vma_prev(struct mm_struct *mm, unsigned long addr,
return vma;
}
-/*
- * Verify that the stack growth is acceptable and
- * update accounting. This is shared with both the
- * grow-up and grow-down cases.
- */
-static int acct_stack_growth(struct vm_area_struct *vma,
- unsigned long size, unsigned long grow)
-{
- struct mm_struct *mm = vma->vm_mm;
- unsigned long new_start;
-
- /* address space limit tests */
- if (!may_expand_vm(mm, vma->vm_flags, grow))
- return -ENOMEM;
-
- /* Stack limit test */
- if (size > rlimit(RLIMIT_STACK))
- return -ENOMEM;
-
- /* mlock limit tests */
- if (!mlock_future_ok(mm, vma->vm_flags, grow << PAGE_SHIFT))
- return -ENOMEM;
-
- /* Check to ensure the stack will not grow into a hugetlb-only region */
- new_start = (vma->vm_flags & VM_GROWSUP) ? vma->vm_start :
- vma->vm_end - size;
- if (is_hugepage_only_range(vma->vm_mm, new_start, size))
- return -EFAULT;
-
- /*
- * Overcommit.. This must be the final test, as it will
- * update security statistics.
- */
- if (security_vm_enough_memory_mm(mm, grow))
- return -ENOMEM;
-
- return 0;
-}
-
-#if defined(CONFIG_STACK_GROWSUP)
-/*
- * PA-RISC uses this for its stack.
- * vma is the last one with address > vma->vm_end. Have to extend vma.
- */
-static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
-{
- struct mm_struct *mm = vma->vm_mm;
- struct vm_area_struct *next;
- unsigned long gap_addr;
- int error = 0;
- VMA_ITERATOR(vmi, mm, vma->vm_start);
-
- if (!(vma->vm_flags & VM_GROWSUP))
- return -EFAULT;
-
- mmap_assert_write_locked(mm);
-
- /* Guard against exceeding limits of the address space. */
- address &= PAGE_MASK;
- if (address >= (TASK_SIZE & PAGE_MASK))
- return -ENOMEM;
- address += PAGE_SIZE;
-
- /* Enforce stack_guard_gap */
- gap_addr = address + stack_guard_gap;
-
- /* Guard against overflow */
- if (gap_addr < address || gap_addr > TASK_SIZE)
- gap_addr = TASK_SIZE;
-
- next = find_vma_intersection(mm, vma->vm_end, gap_addr);
- if (next && vma_is_accessible(next)) {
- if (!(next->vm_flags & VM_GROWSUP))
- return -ENOMEM;
- /* Check that both stack segments have the same anon_vma? */
- }
-
- if (next)
- vma_iter_prev_range_limit(&vmi, address);
-
- vma_iter_config(&vmi, vma->vm_start, address);
- if (vma_iter_prealloc(&vmi, vma))
- return -ENOMEM;
-
- /* We must make sure the anon_vma is allocated. */
- if (unlikely(anon_vma_prepare(vma))) {
- vma_iter_free(&vmi);
- return -ENOMEM;
- }
-
- /* Lock the VMA before expanding to prevent concurrent page faults */
- vma_start_write(vma);
- /* We update the anon VMA tree. */
- anon_vma_lock_write(vma->anon_vma);
-
- /* Somebody else might have raced and expanded it already */
- if (address > vma->vm_end) {
- unsigned long size, grow;
-
- size = address - vma->vm_start;
- grow = (address - vma->vm_end) >> PAGE_SHIFT;
-
- error = -ENOMEM;
- if (vma->vm_pgoff + (size >> PAGE_SHIFT) >= vma->vm_pgoff) {
- error = acct_stack_growth(vma, size, grow);
- if (!error) {
- if (vma->vm_flags & VM_LOCKED)
- mm->locked_vm += grow;
- vm_stat_account(mm, vma->vm_flags, grow);
- anon_vma_interval_tree_pre_update_vma(vma);
- vma->vm_end = address;
- /* Overwrite old entry in mtree. */
- vma_iter_store(&vmi, vma);
- anon_vma_interval_tree_post_update_vma(vma);
-
- perf_event_mmap(vma);
- }
- }
- }
- anon_vma_unlock_write(vma->anon_vma);
- vma_iter_free(&vmi);
- validate_mm(mm);
- return error;
-}
-#endif /* CONFIG_STACK_GROWSUP */
-
-/*
- * vma is the first one with address < vma->vm_start. Have to extend vma.
- * mmap_lock held for writing.
- */
-static int expand_downwards(struct vm_area_struct *vma, unsigned long address)
-{
- struct mm_struct *mm = vma->vm_mm;
- struct vm_area_struct *prev;
- int error = 0;
- VMA_ITERATOR(vmi, mm, vma->vm_start);
-
- if (!(vma->vm_flags & VM_GROWSDOWN))
- return -EFAULT;
-
- mmap_assert_write_locked(mm);
-
- address &= PAGE_MASK;
- if (address < mmap_min_addr || address < FIRST_USER_ADDRESS)
- return -EPERM;
-
- /* Enforce stack_guard_gap */
- prev = vma_prev(&vmi);
- /* Check that both stack segments have the same anon_vma? */
- if (prev) {
- if (!(prev->vm_flags & VM_GROWSDOWN) &&
- vma_is_accessible(prev) &&
- (address - prev->vm_end < stack_guard_gap))
- return -ENOMEM;
- }
-
- if (prev)
- vma_iter_next_range_limit(&vmi, vma->vm_start);
-
- vma_iter_config(&vmi, address, vma->vm_end);
- if (vma_iter_prealloc(&vmi, vma))
- return -ENOMEM;
-
- /* We must make sure the anon_vma is allocated. */
- if (unlikely(anon_vma_prepare(vma))) {
- vma_iter_free(&vmi);
- return -ENOMEM;
- }
-
- /* Lock the VMA before expanding to prevent concurrent page faults */
- vma_start_write(vma);
- /* We update the anon VMA tree. */
- anon_vma_lock_write(vma->anon_vma);
-
- /* Somebody else might have raced and expanded it already */
- if (address < vma->vm_start) {
- unsigned long size, grow;
-
- size = vma->vm_end - address;
- grow = (vma->vm_start - address) >> PAGE_SHIFT;
-
- error = -ENOMEM;
- if (grow <= vma->vm_pgoff) {
- error = acct_stack_growth(vma, size, grow);
- if (!error) {
- if (vma->vm_flags & VM_LOCKED)
- mm->locked_vm += grow;
- vm_stat_account(mm, vma->vm_flags, grow);
- anon_vma_interval_tree_pre_update_vma(vma);
- vma->vm_start = address;
- vma->vm_pgoff -= grow;
- /* Overwrite old entry in mtree. */
- vma_iter_store(&vmi, vma);
- anon_vma_interval_tree_post_update_vma(vma);
-
- perf_event_mmap(vma);
- }
- }
- }
- anon_vma_unlock_write(vma->anon_vma);
- vma_iter_free(&vmi);
- validate_mm(mm);
- return error;
-}
-
/* enforced gap between the expanding stack and other mappings. */
unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;
diff --git a/mm/vma.c b/mm/vma.c
index 50c0c9c443d2..83c79bb42675 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -202,6 +202,38 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
flush_dcache_mmap_unlock(mapping);
}
+/*
+ * vma has some anon_vma assigned, and is already inserted on that
+ * anon_vma's interval trees.
+ *
+ * Before updating the vma's vm_start / vm_end / vm_pgoff fields, the
+ * vma must be removed from the anon_vma's interval trees using
+ * anon_vma_interval_tree_pre_update_vma().
+ *
+ * After the update, the vma will be reinserted using
+ * anon_vma_interval_tree_post_update_vma().
+ *
+ * The entire update must be protected by exclusive mmap_lock and by
+ * the root anon_vma's mutex.
+ */
+static void
+anon_vma_interval_tree_pre_update_vma(struct vm_area_struct *vma)
+{
+ struct anon_vma_chain *avc;
+
+ list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+ anon_vma_interval_tree_remove(avc, &avc->anon_vma->rb_root);
+}
+
+static void
+anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
+{
+ struct anon_vma_chain *avc;
+
+ list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
+ anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
+}
+
/*
* vma_prepare() - Helper function for handling locking VMAs prior to altering
* @vp: The initialized vma_prepare struct
@@ -510,38 +542,6 @@ static int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
return __split_vma(vmi, vma, addr, new_below);
}
-/*
- * vma has some anon_vma assigned, and is already inserted on that
- * anon_vma's interval trees.
- *
- * Before updating the vma's vm_start / vm_end / vm_pgoff fields, the
- * vma must be removed from the anon_vma's interval trees using
- * anon_vma_interval_tree_pre_update_vma().
- *
- * After the update, the vma will be reinserted using
- * anon_vma_interval_tree_post_update_vma().
- *
- * The entire update must be protected by exclusive mmap_lock and by
- * the root anon_vma's mutex.
- */
-void
-anon_vma_interval_tree_pre_update_vma(struct vm_area_struct *vma)
-{
- struct anon_vma_chain *avc;
-
- list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
- anon_vma_interval_tree_remove(avc, &avc->anon_vma->rb_root);
-}
-
-void
-anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
-{
- struct anon_vma_chain *avc;
-
- list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
- anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
-}
-
/*
* dup_anon_vma() - Helper function to duplicate anon_vma
* @dst: The destination VMA
@@ -2669,3 +2669,208 @@ unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
return gap;
}
+
+/*
+ * Verify that the stack growth is acceptable and
+ * update accounting. This is shared with both the
+ * grow-up and grow-down cases.
+ */
+static int acct_stack_growth(struct vm_area_struct *vma,
+ unsigned long size, unsigned long grow)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long new_start;
+
+ /* address space limit tests */
+ if (!may_expand_vm(mm, vma->vm_flags, grow))
+ return -ENOMEM;
+
+ /* Stack limit test */
+ if (size > rlimit(RLIMIT_STACK))
+ return -ENOMEM;
+
+ /* mlock limit tests */
+ if (!mlock_future_ok(mm, vma->vm_flags, grow << PAGE_SHIFT))
+ return -ENOMEM;
+
+ /* Check to ensure the stack will not grow into a hugetlb-only region */
+ new_start = (vma->vm_flags & VM_GROWSUP) ? vma->vm_start :
+ vma->vm_end - size;
+ if (is_hugepage_only_range(vma->vm_mm, new_start, size))
+ return -EFAULT;
+
+ /*
+ * Overcommit.. This must be the final test, as it will
+ * update security statistics.
+ */
+ if (security_vm_enough_memory_mm(mm, grow))
+ return -ENOMEM;
+
+ return 0;
+}
+
+#if defined(CONFIG_STACK_GROWSUP)
+/*
+ * PA-RISC uses this for its stack.
+ * vma is the last one with address > vma->vm_end. Have to extend vma.
+ */
+int expand_upwards(struct vm_area_struct *vma, unsigned long address)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *next;
+ unsigned long gap_addr;
+ int error = 0;
+ VMA_ITERATOR(vmi, mm, vma->vm_start);
+
+ if (!(vma->vm_flags & VM_GROWSUP))
+ return -EFAULT;
+
+ mmap_assert_write_locked(mm);
+
+ /* Guard against exceeding limits of the address space. */
+ address &= PAGE_MASK;
+ if (address >= (TASK_SIZE & PAGE_MASK))
+ return -ENOMEM;
+ address += PAGE_SIZE;
+
+ /* Enforce stack_guard_gap */
+ gap_addr = address + stack_guard_gap;
+
+ /* Guard against overflow */
+ if (gap_addr < address || gap_addr > TASK_SIZE)
+ gap_addr = TASK_SIZE;
+
+ next = find_vma_intersection(mm, vma->vm_end, gap_addr);
+ if (next && vma_is_accessible(next)) {
+ if (!(next->vm_flags & VM_GROWSUP))
+ return -ENOMEM;
+ /* Check that both stack segments have the same anon_vma? */
+ }
+
+ if (next)
+ vma_iter_prev_range_limit(&vmi, address);
+
+ vma_iter_config(&vmi, vma->vm_start, address);
+ if (vma_iter_prealloc(&vmi, vma))
+ return -ENOMEM;
+
+ /* We must make sure the anon_vma is allocated. */
+ if (unlikely(anon_vma_prepare(vma))) {
+ vma_iter_free(&vmi);
+ return -ENOMEM;
+ }
+
+ /* Lock the VMA before expanding to prevent concurrent page faults */
+ vma_start_write(vma);
+ /* We update the anon VMA tree. */
+ anon_vma_lock_write(vma->anon_vma);
+
+ /* Somebody else might have raced and expanded it already */
+ if (address > vma->vm_end) {
+ unsigned long size, grow;
+
+ size = address - vma->vm_start;
+ grow = (address - vma->vm_end) >> PAGE_SHIFT;
+
+ error = -ENOMEM;
+ if (vma->vm_pgoff + (size >> PAGE_SHIFT) >= vma->vm_pgoff) {
+ error = acct_stack_growth(vma, size, grow);
+ if (!error) {
+ if (vma->vm_flags & VM_LOCKED)
+ mm->locked_vm += grow;
+ vm_stat_account(mm, vma->vm_flags, grow);
+ anon_vma_interval_tree_pre_update_vma(vma);
+ vma->vm_end = address;
+ /* Overwrite old entry in mtree. */
+ vma_iter_store(&vmi, vma);
+ anon_vma_interval_tree_post_update_vma(vma);
+
+ perf_event_mmap(vma);
+ }
+ }
+ }
+ anon_vma_unlock_write(vma->anon_vma);
+ vma_iter_free(&vmi);
+ validate_mm(mm);
+ return error;
+}
+#endif /* CONFIG_STACK_GROWSUP */
+
+/*
+ * vma is the first one with address < vma->vm_start. Have to extend vma.
+ * mmap_lock held for writing.
+ */
+int expand_downwards(struct vm_area_struct *vma, unsigned long address)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct vm_area_struct *prev;
+ int error = 0;
+ VMA_ITERATOR(vmi, mm, vma->vm_start);
+
+ if (!(vma->vm_flags & VM_GROWSDOWN))
+ return -EFAULT;
+
+ mmap_assert_write_locked(mm);
+
+ address &= PAGE_MASK;
+ if (address < mmap_min_addr || address < FIRST_USER_ADDRESS)
+ return -EPERM;
+
+ /* Enforce stack_guard_gap */
+ prev = vma_prev(&vmi);
+ /* Check that both stack segments have the same anon_vma? */
+ if (prev) {
+ if (!(prev->vm_flags & VM_GROWSDOWN) &&
+ vma_is_accessible(prev) &&
+ (address - prev->vm_end < stack_guard_gap))
+ return -ENOMEM;
+ }
+
+ if (prev)
+ vma_iter_next_range_limit(&vmi, vma->vm_start);
+
+ vma_iter_config(&vmi, address, vma->vm_end);
+ if (vma_iter_prealloc(&vmi, vma))
+ return -ENOMEM;
+
+ /* We must make sure the anon_vma is allocated. */
+ if (unlikely(anon_vma_prepare(vma))) {
+ vma_iter_free(&vmi);
+ return -ENOMEM;
+ }
+
+ /* Lock the VMA before expanding to prevent concurrent page faults */
+ vma_start_write(vma);
+ /* We update the anon VMA tree. */
+ anon_vma_lock_write(vma->anon_vma);
+
+ /* Somebody else might have raced and expanded it already */
+ if (address < vma->vm_start) {
+ unsigned long size, grow;
+
+ size = vma->vm_end - address;
+ grow = (vma->vm_start - address) >> PAGE_SHIFT;
+
+ error = -ENOMEM;
+ if (grow <= vma->vm_pgoff) {
+ error = acct_stack_growth(vma, size, grow);
+ if (!error) {
+ if (vma->vm_flags & VM_LOCKED)
+ mm->locked_vm += grow;
+ vm_stat_account(mm, vma->vm_flags, grow);
+ anon_vma_interval_tree_pre_update_vma(vma);
+ vma->vm_start = address;
+ vma->vm_pgoff -= grow;
+ /* Overwrite old entry in mtree. */
+ vma_iter_store(&vmi, vma);
+ anon_vma_interval_tree_post_update_vma(vma);
+
+ perf_event_mmap(vma);
+ }
+ }
+ }
+ anon_vma_unlock_write(vma->anon_vma);
+ vma_iter_free(&vmi);
+ validate_mm(mm);
+ return error;
+}
diff --git a/mm/vma.h b/mm/vma.h
index c60f37d89eb1..6c460a120f82 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -139,12 +139,6 @@ void validate_mm(struct mm_struct *mm);
#define validate_mm(mm) do { } while (0)
#endif
-/* Required for expand_downwards(). */
-void anon_vma_interval_tree_pre_update_vma(struct vm_area_struct *vma);
-
-/* Required for expand_downwards(). */
-void anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma);
-
int vma_expand(struct vma_merge_struct *vmg);
int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
unsigned long start, unsigned long end, pgoff_t pgoff);
@@ -478,4 +472,10 @@ static inline bool can_modify_vma_madv(struct vm_area_struct *vma, int behavior)
#endif
+#if defined(CONFIG_STACK_GROWSUP)
+int expand_upwards(struct vm_area_struct *vma, unsigned long address);
+#endif
+
+int expand_downwards(struct vm_area_struct *vma, unsigned long address);
+
#endif /* __MM_VMA_H */
diff --git a/tools/testing/vma/vma.c b/tools/testing/vma/vma.c
index 39ee61e55634..891d87a9ad6b 100644
--- a/tools/testing/vma/vma.c
+++ b/tools/testing/vma/vma.c
@@ -53,6 +53,11 @@ struct task_struct *get_current(void)
return &__current;
}
+unsigned long rlimit(unsigned int limit)
+{
+ return (unsigned long)-1;
+}
+
/* Helper function to simply allocate a VMA. */
static struct vm_area_struct *alloc_vma(struct mm_struct *mm,
unsigned long start,
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 6ad8bd8edaad..fab3f3bdf2f0 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -79,6 +79,11 @@ extern unsigned long dac_mmap_min_addr;
#define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK)
+#define RLIMIT_STACK 3 /* max stack size */
+#define RLIMIT_MEMLOCK 8 /* max locked-in-memory address space */
+
+#define CAP_IPC_LOCK 14
+
#ifdef CONFIG_64BIT
/* VM is sealed, in vm_flags */
#define VM_SEALED _BITUL(63)
@@ -478,6 +483,8 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached)
extern const struct vm_operations_struct vma_dummy_vm_ops;
+extern unsigned long rlimit(unsigned int limit);
+
static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
{
memset(vma, 0, sizeof(*vma));
@@ -1114,4 +1121,59 @@ static inline unsigned long vm_end_gap(struct vm_area_struct *vma)
return vm_end;
}
+static inline int is_hugepage_only_range(struct mm_struct *mm,
+ unsigned long addr, unsigned long len)
+{
+ return 0;
+}
+
+static inline bool vma_is_accessible(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & VM_ACCESS_FLAGS;
+}
+
+static inline bool capable(int cap)
+{
+ return true;
+}
+
+static inline bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
+ unsigned long bytes)
+{
+ unsigned long locked_pages, limit_pages;
+
+ if (!(flags & VM_LOCKED) || capable(CAP_IPC_LOCK))
+ return true;
+
+ locked_pages = bytes >> PAGE_SHIFT;
+ locked_pages += mm->locked_vm;
+
+ limit_pages = rlimit(RLIMIT_MEMLOCK);
+ limit_pages >>= PAGE_SHIFT;
+
+ return locked_pages <= limit_pages;
+}
+
+static inline int __anon_vma_prepare(struct vm_area_struct *vma)
+{
+ struct anon_vma *anon_vma = calloc(1, sizeof(struct anon_vma));
+
+ if (!anon_vma)
+ return -ENOMEM;
+
+ anon_vma->root = anon_vma;
+ vma->anon_vma = anon_vma;
+
+ return 0;
+}
+
+static inline int anon_vma_prepare(struct vm_area_struct *vma)
+{
+ if (likely(vma->anon_vma))
+ return 0;
+
+ return __anon_vma_prepare(vma);
+}
+
+
#endif /* __MM_VMA_INTERNAL_H */
--
2.47.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 5/5] mm/vma: move __vm_munmap() to mm/vma.c
2024-12-03 18:05 [PATCH 0/5] mm/vma: make more mmap logic userland testable Lorenzo Stoakes
` (3 preceding siblings ...)
2024-12-03 18:05 ` [PATCH 4/5] mm/vma: move stack expansion logic to mm/vma.c Lorenzo Stoakes
@ 2024-12-03 18:05 ` Lorenzo Stoakes
2024-12-04 23:56 ` [PATCH 0/5] mm/vma: make more mmap logic userland testable Wei Yang
5 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-03 18:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel, linux-kernel
This was arbitrary left in mmap.c it makes no sense being there, move it to
vma.c to render it testable.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/mmap.c | 18 ------------------
mm/vma.c | 18 ++++++++++++++++++
mm/vma.h | 2 ++
tools/testing/vma/vma_internal.h | 9 +++++++++
4 files changed, 29 insertions(+), 18 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 55a8f2332b7c..1c6bdffa13dd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1044,24 +1044,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return ret;
}
-static int __vm_munmap(unsigned long start, size_t len, bool unlock)
-{
- int ret;
- struct mm_struct *mm = current->mm;
- LIST_HEAD(uf);
- VMA_ITERATOR(vmi, mm, start);
-
- if (mmap_write_lock_killable(mm))
- return -EINTR;
-
- ret = do_vmi_munmap(&vmi, mm, start, len, &uf, unlock);
- if (ret || !unlock)
- mmap_write_unlock(mm);
-
- userfaultfd_unmap_complete(mm, &uf);
- return ret;
-}
-
int vm_munmap(unsigned long start, size_t len)
{
return __vm_munmap(start, len, false);
diff --git a/mm/vma.c b/mm/vma.c
index 83c79bb42675..a06747845cac 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2874,3 +2874,21 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
validate_mm(mm);
return error;
}
+
+int __vm_munmap(unsigned long start, size_t len, bool unlock)
+{
+ int ret;
+ struct mm_struct *mm = current->mm;
+ LIST_HEAD(uf);
+ VMA_ITERATOR(vmi, mm, start);
+
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+
+ ret = do_vmi_munmap(&vmi, mm, start, len, &uf, unlock);
+ if (ret || !unlock)
+ mmap_write_unlock(mm);
+
+ userfaultfd_unmap_complete(mm, &uf);
+ return ret;
+}
diff --git a/mm/vma.h b/mm/vma.h
index 6c460a120f82..295d44ea54db 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -478,4 +478,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address);
int expand_downwards(struct vm_area_struct *vma, unsigned long address);
+int __vm_munmap(unsigned long start, size_t len, bool unlock);
+
#endif /* __MM_VMA_H */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index fab3f3bdf2f0..a7de59a0d694 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -906,6 +906,11 @@ static inline void mmap_write_unlock(struct mm_struct *)
{
}
+static inline int mmap_write_lock_killable(struct mm_struct *)
+{
+ return 0;
+}
+
static inline bool can_modify_mm(struct mm_struct *mm,
unsigned long start,
unsigned long end)
@@ -1175,5 +1180,9 @@ static inline int anon_vma_prepare(struct vm_area_struct *vma)
return __anon_vma_prepare(vma);
}
+static inline void userfaultfd_unmap_complete(struct mm_struct *mm,
+ struct list_head *uf)
+{
+}
#endif /* __MM_VMA_INTERNAL_H */
--
2.47.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c
2024-12-03 18:05 ` [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c Lorenzo Stoakes
@ 2024-12-04 11:55 ` kernel test robot
2024-12-04 12:10 ` Lorenzo Stoakes
2024-12-04 12:08 ` Lorenzo Stoakes
2024-12-04 13:10 ` kernel test robot
2 siblings, 1 reply; 21+ messages in thread
From: kernel test robot @ 2024-12-04 11:55 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: oe-kbuild-all, Linux Memory Management List, Liam R . Howlett,
Vlastimil Babka, Jann Horn, Eric Biederman, Kees Cook,
Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
linux-kernel
Hi Lorenzo,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Lorenzo-Stoakes/mm-vma-move-brk-internals-to-mm-vma-c/20241204-115150
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/3d24b9e67bb0261539ca921d1188a10a1b4d4357.1733248985.git.lorenzo.stoakes%40oracle.com
patch subject: [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c
config: mips-allnoconfig (https://download.01.org/0day-ci/archive/20241204/202412041907.3DXYQrz6-lkp@intel.com/config)
compiler: mips-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241204/202412041907.3DXYQrz6-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412041907.3DXYQrz6-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from arch/mips/include/asm/cacheflush.h:13,
from include/linux/cacheflush.h:5,
from include/linux/highmem.h:8,
from include/linux/bvec.h:10,
from include/linux/blk_types.h:10,
from include/linux/writeback.h:13,
from include/linux/backing-dev.h:16,
from mm/vma_internal.h:12,
from mm/vma.c:7:
mm/vma.c: In function 'do_brk_flags':
>> include/linux/mm.h:450:44: error: 'READ_IMPLIES_EXEC' undeclared (first use in this function)
450 | #define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
| ^~~~~~~~~~~~~~~~~
include/linux/mm.h:453:55: note: in expansion of macro 'TASK_EXEC'
453 | #define VM_DATA_FLAGS_TSK_EXEC (VM_READ | VM_WRITE | TASK_EXEC | \
| ^~~~~~~~~
arch/mips/include/asm/page.h:215:33: note: in expansion of macro 'VM_DATA_FLAGS_TSK_EXEC'
215 | #define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
| ^~~~~~~~~~~~~~~~~~~~~~
mm/vma.c:2503:18: note: in expansion of macro 'VM_DATA_DEFAULT_FLAGS'
2503 | flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
| ^~~~~~~~~~~~~~~~~~~~~
include/linux/mm.h:450:44: note: each undeclared identifier is reported only once for each function it appears in
450 | #define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
| ^~~~~~~~~~~~~~~~~
include/linux/mm.h:453:55: note: in expansion of macro 'TASK_EXEC'
453 | #define VM_DATA_FLAGS_TSK_EXEC (VM_READ | VM_WRITE | TASK_EXEC | \
| ^~~~~~~~~
arch/mips/include/asm/page.h:215:33: note: in expansion of macro 'VM_DATA_FLAGS_TSK_EXEC'
215 | #define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
| ^~~~~~~~~~~~~~~~~~~~~~
mm/vma.c:2503:18: note: in expansion of macro 'VM_DATA_DEFAULT_FLAGS'
2503 | flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
| ^~~~~~~~~~~~~~~~~~~~~
vim +/READ_IMPLIES_EXEC +450 include/linux/mm.h
a8bef8ff6ea15fa Mel Gorman 2010-05-24 449
c62da0c35d58518 Anshuman Khandual 2020-04-10 @450 #define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
c62da0c35d58518 Anshuman Khandual 2020-04-10 451
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c
2024-12-03 18:05 ` [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c Lorenzo Stoakes
2024-12-04 11:55 ` kernel test robot
@ 2024-12-04 12:08 ` Lorenzo Stoakes
2024-12-04 13:10 ` kernel test robot
2 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-04 12:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel, linux-kernel
On Tue, Dec 03, 2024 at 06:05:08PM +0000, Lorenzo Stoakes wrote:
> Now we have moved mmap_region() internals to mm/vma.c, making it available
> to userland testing, it makes sense to do the same with brk().
>
> This continues the pattern of VMA heavy lifting being done in mm/vma.c in
> an environment where it can be subject to straightforward unit and
> regression testing, with other VMA-adjacent files becoming wrappers around
> this functionality.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Hi Andrew,
Could you apply the below fix-patch? It seems we have another one of those
thorny header dependency things going on here.
Thanks,
Lorenzo
----8<----
From 0c61e8fe9f9ffe9fd7ac8e1fde6e8cad8bac2b30 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Wed, 4 Dec 2024 12:00:35 +0000
Subject: [PATCH] mm/vma: add missing personality header import
Some architectures have different header dependency chains, we incorrectly
failed to important linux/personality.h which broke MIPS. Fix this.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
mm/vma_internal.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/vma_internal.h b/mm/vma_internal.h
index fc5f172a36bd..2f05735ff190 100644
--- a/mm/vma_internal.h
+++ b/mm/vma_internal.h
@@ -35,6 +35,7 @@
#include <linux/mutex.h>
#include <linux/pagemap.h>
#include <linux/perf_event.h>
+#include <linux/personality.h>
#include <linux/pfn.h>
#include <linux/rcupdate.h>
#include <linux/rmap.h>
--
2.47.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c
2024-12-04 11:55 ` kernel test robot
@ 2024-12-04 12:10 ` Lorenzo Stoakes
0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-04 12:10 UTC (permalink / raw)
To: kernel test robot
Cc: Andrew Morton, oe-kbuild-all, Linux Memory Management List,
Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
linux-fsdevel, linux-kernel
On Wed, Dec 04, 2024 at 07:55:16PM +0800, kernel test robot wrote:
> Hi Lorenzo,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Lorenzo-Stoakes/mm-vma-move-brk-internals-to-mm-vma-c/20241204-115150
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/3d24b9e67bb0261539ca921d1188a10a1b4d4357.1733248985.git.lorenzo.stoakes%40oracle.com
> patch subject: [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c
> config: mips-allnoconfig (https://download.01.org/0day-ci/archive/20241204/202412041907.3DXYQrz6-lkp@intel.com/config)
> compiler: mips-linux-gcc (GCC) 14.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241204/202412041907.3DXYQrz6-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202412041907.3DXYQrz6-lkp@intel.com/
Thanks for the report!
Seems to be missing a header to include/uapi/linux/personality.h which declares
READ_IMPLES_EXEC, however the standard means of including this appears to be
including linux/personality.h which also imports this header.
I have sent a fixpatch separately.
>
> All errors (new ones prefixed by >>):
>
> In file included from arch/mips/include/asm/cacheflush.h:13,
> from include/linux/cacheflush.h:5,
> from include/linux/highmem.h:8,
> from include/linux/bvec.h:10,
> from include/linux/blk_types.h:10,
> from include/linux/writeback.h:13,
> from include/linux/backing-dev.h:16,
> from mm/vma_internal.h:12,
> from mm/vma.c:7:
> mm/vma.c: In function 'do_brk_flags':
> >> include/linux/mm.h:450:44: error: 'READ_IMPLIES_EXEC' undeclared (first use in this function)
> 450 | #define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
> | ^~~~~~~~~~~~~~~~~
> include/linux/mm.h:453:55: note: in expansion of macro 'TASK_EXEC'
> 453 | #define VM_DATA_FLAGS_TSK_EXEC (VM_READ | VM_WRITE | TASK_EXEC | \
> | ^~~~~~~~~
> arch/mips/include/asm/page.h:215:33: note: in expansion of macro 'VM_DATA_FLAGS_TSK_EXEC'
> 215 | #define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
> | ^~~~~~~~~~~~~~~~~~~~~~
> mm/vma.c:2503:18: note: in expansion of macro 'VM_DATA_DEFAULT_FLAGS'
> 2503 | flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
> | ^~~~~~~~~~~~~~~~~~~~~
> include/linux/mm.h:450:44: note: each undeclared identifier is reported only once for each function it appears in
> 450 | #define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
> | ^~~~~~~~~~~~~~~~~
> include/linux/mm.h:453:55: note: in expansion of macro 'TASK_EXEC'
> 453 | #define VM_DATA_FLAGS_TSK_EXEC (VM_READ | VM_WRITE | TASK_EXEC | \
> | ^~~~~~~~~
> arch/mips/include/asm/page.h:215:33: note: in expansion of macro 'VM_DATA_FLAGS_TSK_EXEC'
> 215 | #define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
> | ^~~~~~~~~~~~~~~~~~~~~~
> mm/vma.c:2503:18: note: in expansion of macro 'VM_DATA_DEFAULT_FLAGS'
> 2503 | flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
> | ^~~~~~~~~~~~~~~~~~~~~
>
>
> vim +/READ_IMPLIES_EXEC +450 include/linux/mm.h
>
> a8bef8ff6ea15fa Mel Gorman 2010-05-24 449
> c62da0c35d58518 Anshuman Khandual 2020-04-10 @450 #define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
> c62da0c35d58518 Anshuman Khandual 2020-04-10 451
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c
2024-12-03 18:05 ` [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c Lorenzo Stoakes
2024-12-04 11:55 ` kernel test robot
2024-12-04 12:08 ` Lorenzo Stoakes
@ 2024-12-04 13:10 ` kernel test robot
2 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2024-12-04 13:10 UTC (permalink / raw)
To: Lorenzo Stoakes, Andrew Morton
Cc: llvm, oe-kbuild-all, Linux Memory Management List,
Liam R . Howlett, Vlastimil Babka, Jann Horn, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara,
linux-fsdevel, linux-kernel
Hi Lorenzo,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Lorenzo-Stoakes/mm-vma-move-brk-internals-to-mm-vma-c/20241204-115150
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/3d24b9e67bb0261539ca921d1188a10a1b4d4357.1733248985.git.lorenzo.stoakes%40oracle.com
patch subject: [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c
config: mips-ath25_defconfig (https://download.01.org/0day-ci/archive/20241204/202412042012.zymuBpfD-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 592c0fe55f6d9a811028b5f3507be91458ab2713)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241204/202412042012.zymuBpfD-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412042012.zymuBpfD-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from mm/vma.c:7:
In file included from mm/vma_internal.h:12:
In file included from include/linux/backing-dev.h:16:
In file included from include/linux/writeback.h:13:
In file included from include/linux/blk_types.h:10:
In file included from include/linux/bvec.h:10:
In file included from include/linux/highmem.h:8:
In file included from include/linux/cacheflush.h:5:
In file included from arch/mips/include/asm/cacheflush.h:13:
In file included from include/linux/mm.h:2223:
include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
518 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
In file included from mm/vma.c:7:
In file included from mm/vma_internal.h:29:
include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
47 | __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
| ~~~~~~~~~~~ ^ ~~~
include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
49 | NR_ZONE_LRU_BASE + lru, nr_pages);
| ~~~~~~~~~~~~~~~~ ^ ~~~
>> mm/vma.c:2503:11: error: use of undeclared identifier 'READ_IMPLIES_EXEC'
2503 | flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
| ^
arch/mips/include/asm/page.h:215:31: note: expanded from macro 'VM_DATA_DEFAULT_FLAGS'
215 | #define VM_DATA_DEFAULT_FLAGS VM_DATA_FLAGS_TSK_EXEC
| ^
include/linux/mm.h:453:54: note: expanded from macro 'VM_DATA_FLAGS_TSK_EXEC'
453 | #define VM_DATA_FLAGS_TSK_EXEC (VM_READ | VM_WRITE | TASK_EXEC | \
| ^
include/linux/mm.h:450:44: note: expanded from macro 'TASK_EXEC'
450 | #define TASK_EXEC ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0)
| ^
3 warnings and 1 error generated.
vim +/READ_IMPLIES_EXEC +2503 mm/vma.c
2481
2482 /*
2483 * do_brk_flags() - Increase the brk vma if the flags match.
2484 * @vmi: The vma iterator
2485 * @addr: The start address
2486 * @len: The length of the increase
2487 * @vma: The vma,
2488 * @flags: The VMA Flags
2489 *
2490 * Extend the brk VMA from addr to addr + len. If the VMA is NULL or the flags
2491 * do not match then create a new anonymous VMA. Eventually we may be able to
2492 * do some brk-specific accounting here.
2493 */
2494 int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
2495 unsigned long addr, unsigned long len, unsigned long flags)
2496 {
2497 struct mm_struct *mm = current->mm;
2498
2499 /*
2500 * Check against address space limits by the changed size
2501 * Note: This happens *after* clearing old mappings in some code paths.
2502 */
> 2503 flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/5] mm/vma: make more mmap logic userland testable
2024-12-03 18:05 [PATCH 0/5] mm/vma: make more mmap logic userland testable Lorenzo Stoakes
` (4 preceding siblings ...)
2024-12-03 18:05 ` [PATCH 5/5] mm/vma: move __vm_munmap() " Lorenzo Stoakes
@ 2024-12-04 23:56 ` Wei Yang
2024-12-05 7:03 ` Lorenzo Stoakes
5 siblings, 1 reply; 21+ messages in thread
From: Wei Yang @ 2024-12-04 23:56 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Eric Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-mm, linux-fsdevel, linux-kernel
On Tue, Dec 03, 2024 at 06:05:07PM +0000, Lorenzo Stoakes wrote:
>This series carries on the work the work started in previous series and
^^^ ^^^
Duplicated?
>continued in commit 52956b0d7fb9 ("mm: isolate mmap internal logic to
>mm/vma.c"), moving the remainder of memory mapping implementation details
>logic into mm/vma.c allowing the bulk of the mapping logic to be unit
>tested.
>
>It is highly useful to do so, as this means we can both fundamentally test
>this core logic, and introduce regression tests to ensure any issues
>previously resolved do not recur.
>
>Vitally, this includes the do_brk_flags() function, meaning we have both
>core means of userland mapping memory now testable.
>
>Performance testing was performed after this change given the brk() system
>call's sensitivity to change, and no performance regression was observed.
May I ask what performance test is done?
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock
2024-12-03 18:05 ` [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock Lorenzo Stoakes
@ 2024-12-05 0:18 ` Wei Yang
2024-12-05 7:01 ` Lorenzo Stoakes
2024-12-05 7:06 ` Lorenzo Stoakes
2024-12-10 17:14 ` Jann Horn
2024-12-14 1:05 ` Kees Cook
2 siblings, 2 replies; 21+ messages in thread
From: Wei Yang @ 2024-12-05 0:18 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Eric Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-mm, linux-fsdevel, linux-kernel
On Tue, Dec 03, 2024 at 06:05:10PM +0000, Lorenzo Stoakes wrote:
>Right now fs/exec.c invokes expand_downwards(), an otherwise internal
>implementation detail of the VMA logic in order to ensure that an arg page
>can be obtained by get_user_pages_remote().
>
>In order to be able to move the stack expansion logic into mm/vma.c in
>order to make it available to userland testing we need to find an
Looks the second "in order" is not necessary.
Not a native speaker, just my personal feeling.
>alternative approach here.
>
>We do so by providing the mmap_read_lock_maybe_expand() function which also
>helpfully documents what get_arg_page() is doing here and adds an
>additional check against VM_GROWSDOWN to make explicit that the stack
>expansion logic is only invoked when the VMA is indeed a downward-growing
>stack.
>
>This allows expand_downwards() to become a static function.
>
>Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be
>currently user-visible in any way, that is place within an rmap or VMA
>tree. It must be a newly allocated VMA.
>
>This is the case when exec invokes this function.
>
>Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>---
> fs/exec.c | 14 +++---------
> include/linux/mm.h | 5 ++---
> mm/mmap.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 58 insertions(+), 15 deletions(-)
>
>diff --git a/fs/exec.c b/fs/exec.c
>index 98cb7ba9983c..1e1f79c514de 100644
>--- a/fs/exec.c
>+++ b/fs/exec.c
>@@ -205,18 +205,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> /*
> * Avoid relying on expanding the stack down in GUP (which
> * does not work for STACK_GROWSUP anyway), and just do it
>- * by hand ahead of time.
>+ * ahead of time.
> */
>- if (write && pos < vma->vm_start) {
>- mmap_write_lock(mm);
>- ret = expand_downwards(vma, pos);
>- if (unlikely(ret < 0)) {
>- mmap_write_unlock(mm);
>- return NULL;
>- }
>- mmap_write_downgrade(mm);
>- } else
>- mmap_read_lock(mm);
>+ if (!mmap_read_lock_maybe_expand(mm, vma, pos, write))
>+ return NULL;
>
> /*
> * We are doing an exec(). 'current' is the process
>diff --git a/include/linux/mm.h b/include/linux/mm.h
>index 4eb8e62d5c67..48312a934454 100644
>--- a/include/linux/mm.h
>+++ b/include/linux/mm.h
>@@ -3313,6 +3313,8 @@ extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admi
> extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> extern void exit_mmap(struct mm_struct *);
> int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
>+bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>+ unsigned long addr, bool write);
>
> static inline int check_data_rlimit(unsigned long rlim,
> unsigned long new,
>@@ -3426,9 +3428,6 @@ extern unsigned long stack_guard_gap;
> int expand_stack_locked(struct vm_area_struct *vma, unsigned long address);
> struct vm_area_struct *expand_stack(struct mm_struct * mm, unsigned long addr);
>
>-/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */
>-int expand_downwards(struct vm_area_struct *vma, unsigned long address);
>-
> /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
> extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
> extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
>diff --git a/mm/mmap.c b/mm/mmap.c
>index f053de1d6fae..4df38d3717ff 100644
>--- a/mm/mmap.c
>+++ b/mm/mmap.c
>@@ -1009,7 +1009,7 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> * vma is the first one with address < vma->vm_start. Have to extend vma.
> * mmap_lock held for writing.
> */
>-int expand_downwards(struct vm_area_struct *vma, unsigned long address)
>+static int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct vm_area_struct *prev;
>@@ -1940,3 +1940,55 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> /* Shrink the vma to just the new range */
> return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> }
>+
>+#ifdef CONFIG_MMU
>+/*
>+ * Obtain a read lock on mm->mmap_lock, if the specified address is below the
>+ * start of the VMA, the intent is to perform a write, and it is a
>+ * downward-growing stack, then attempt to expand the stack to contain it.
>+ *
>+ * This function is intended only for obtaining an argument page from an ELF
>+ * image, and is almost certainly NOT what you want to use for any other
>+ * purpose.
>+ *
>+ * IMPORTANT - VMA fields are accessed without an mmap lock being held, so the
>+ * VMA referenced must not be linked in any user-visible tree, i.e. it must be a
>+ * new VMA being mapped.
>+ *
>+ * The function assumes that addr is either contained within the VMA or below
>+ * it, and makes no attempt to validate this value beyond that.
>+ *
>+ * Returns true if the read lock was obtained and a stack was perhaps expanded,
>+ * false if the stack expansion failed.
>+ *
>+ * On stack expansion the function temporarily acquires an mmap write lock
>+ * before downgrading it.
>+ */
>+bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
>+ struct vm_area_struct *new_vma,
>+ unsigned long addr, bool write)
>+{
>+ if (!write || addr >= new_vma->vm_start) {
>+ mmap_read_lock(mm);
>+ return true;
>+ }
>+
>+ if (!(new_vma->vm_flags & VM_GROWSDOWN))
>+ return false;
>+
In expand_downwards() we have this checked.
Maybe we just leave this done in one place is enough?
>+ mmap_write_lock(mm);
>+ if (expand_downwards(new_vma, addr)) {
>+ mmap_write_unlock(mm);
>+ return false;
>+ }
>+
>+ mmap_write_downgrade(mm);
>+ return true;
>+}
>+#else
>+bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
>+ unsigned long addr, bool write)
>+{
>+ return false;
>+}
>+#endif
>--
>2.47.1
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock
2024-12-05 0:18 ` Wei Yang
@ 2024-12-05 7:01 ` Lorenzo Stoakes
2024-12-08 11:27 ` Wei Yang
2024-12-05 7:06 ` Lorenzo Stoakes
1 sibling, 1 reply; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-05 7:01 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Eric Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-mm, linux-fsdevel, linux-kernel
On Thu, Dec 05, 2024 at 12:18:19AM +0000, Wei Yang wrote:
> On Tue, Dec 03, 2024 at 06:05:10PM +0000, Lorenzo Stoakes wrote:
> >Right now fs/exec.c invokes expand_downwards(), an otherwise internal
> >implementation detail of the VMA logic in order to ensure that an arg page
> >can be obtained by get_user_pages_remote().
> >
> >In order to be able to move the stack expansion logic into mm/vma.c in
> >order to make it available to userland testing we need to find an
>
> Looks the second "in order" is not necessary.
>
> Not a native speaker, just my personal feeling.
>
> >alternative approach here.
> >
> >We do so by providing the mmap_read_lock_maybe_expand() function which also
> >helpfully documents what get_arg_page() is doing here and adds an
> >additional check against VM_GROWSDOWN to make explicit that the stack
> >expansion logic is only invoked when the VMA is indeed a downward-growing
> >stack.
> >
> >This allows expand_downwards() to become a static function.
> >
> >Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be
> >currently user-visible in any way, that is place within an rmap or VMA
> >tree. It must be a newly allocated VMA.
> >
> >This is the case when exec invokes this function.
> >
> >Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >---
> > fs/exec.c | 14 +++---------
> > include/linux/mm.h | 5 ++---
> > mm/mmap.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-
> > 3 files changed, 58 insertions(+), 15 deletions(-)
> >
> >diff --git a/fs/exec.c b/fs/exec.c
> >index 98cb7ba9983c..1e1f79c514de 100644
> >--- a/fs/exec.c
> >+++ b/fs/exec.c
> >@@ -205,18 +205,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> > /*
> > * Avoid relying on expanding the stack down in GUP (which
> > * does not work for STACK_GROWSUP anyway), and just do it
> >- * by hand ahead of time.
> >+ * ahead of time.
> > */
> >- if (write && pos < vma->vm_start) {
> >- mmap_write_lock(mm);
> >- ret = expand_downwards(vma, pos);
> >- if (unlikely(ret < 0)) {
> >- mmap_write_unlock(mm);
> >- return NULL;
> >- }
> >- mmap_write_downgrade(mm);
> >- } else
> >- mmap_read_lock(mm);
> >+ if (!mmap_read_lock_maybe_expand(mm, vma, pos, write))
> >+ return NULL;
> >
> > /*
> > * We are doing an exec(). 'current' is the process
> >diff --git a/include/linux/mm.h b/include/linux/mm.h
> >index 4eb8e62d5c67..48312a934454 100644
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -3313,6 +3313,8 @@ extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admi
> > extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> > extern void exit_mmap(struct mm_struct *);
> > int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> >+bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> >+ unsigned long addr, bool write);
> >
> > static inline int check_data_rlimit(unsigned long rlim,
> > unsigned long new,
> >@@ -3426,9 +3428,6 @@ extern unsigned long stack_guard_gap;
> > int expand_stack_locked(struct vm_area_struct *vma, unsigned long address);
> > struct vm_area_struct *expand_stack(struct mm_struct * mm, unsigned long addr);
> >
> >-/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */
> >-int expand_downwards(struct vm_area_struct *vma, unsigned long address);
> >-
> > /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
> > extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
> > extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
> >diff --git a/mm/mmap.c b/mm/mmap.c
> >index f053de1d6fae..4df38d3717ff 100644
> >--- a/mm/mmap.c
> >+++ b/mm/mmap.c
> >@@ -1009,7 +1009,7 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > * vma is the first one with address < vma->vm_start. Have to extend vma.
> > * mmap_lock held for writing.
> > */
> >-int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >+static int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > struct vm_area_struct *prev;
> >@@ -1940,3 +1940,55 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > /* Shrink the vma to just the new range */
> > return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> > }
> >+
> >+#ifdef CONFIG_MMU
> >+/*
> >+ * Obtain a read lock on mm->mmap_lock, if the specified address is below the
> >+ * start of the VMA, the intent is to perform a write, and it is a
> >+ * downward-growing stack, then attempt to expand the stack to contain it.
> >+ *
> >+ * This function is intended only for obtaining an argument page from an ELF
> >+ * image, and is almost certainly NOT what you want to use for any other
> >+ * purpose.
> >+ *
> >+ * IMPORTANT - VMA fields are accessed without an mmap lock being held, so the
> >+ * VMA referenced must not be linked in any user-visible tree, i.e. it must be a
> >+ * new VMA being mapped.
> >+ *
> >+ * The function assumes that addr is either contained within the VMA or below
> >+ * it, and makes no attempt to validate this value beyond that.
> >+ *
> >+ * Returns true if the read lock was obtained and a stack was perhaps expanded,
> >+ * false if the stack expansion failed.
> >+ *
> >+ * On stack expansion the function temporarily acquires an mmap write lock
> >+ * before downgrading it.
> >+ */
> >+bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
> >+ struct vm_area_struct *new_vma,
> >+ unsigned long addr, bool write)
> >+{
> >+ if (!write || addr >= new_vma->vm_start) {
> >+ mmap_read_lock(mm);
> >+ return true;
> >+ }
> >+
> >+ if (!(new_vma->vm_flags & VM_GROWSDOWN))
> >+ return false;
> >+
>
> In expand_downwards() we have this checked.
>
> Maybe we just leave this done in one place is enough?
Wei, I feel like I have repeated myself about 'mathematically smallest
code' rather too many times at this stage. Doing an unsolicited drive-by
review applying this concept, which I have roundly and clearly rejected, is
not appreciated.
At any rate, we are checking this _before the mmap lock is acquired_. It is
also self-documenting.
Please try to take on board the point that there are many factors when it
comes to writing kernel code, aversion to possibly generated branches being
only one of them.
>
> >+ mmap_write_lock(mm);
> >+ if (expand_downwards(new_vma, addr)) {
> >+ mmap_write_unlock(mm);
> >+ return false;
> >+ }
> >+
> >+ mmap_write_downgrade(mm);
> >+ return true;
> >+}
> >+#else
> >+bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> >+ unsigned long addr, bool write)
> >+{
> >+ return false;
> >+}
> >+#endif
> >--
> >2.47.1
> >
>
> --
> Wei Yang
> Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/5] mm/vma: make more mmap logic userland testable
2024-12-04 23:56 ` [PATCH 0/5] mm/vma: make more mmap logic userland testable Wei Yang
@ 2024-12-05 7:03 ` Lorenzo Stoakes
2024-12-06 0:30 ` Wei Yang
0 siblings, 1 reply; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-05 7:03 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Eric Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-mm, linux-fsdevel, linux-kernel
On Wed, Dec 04, 2024 at 11:56:32PM +0000, Wei Yang wrote:
> On Tue, Dec 03, 2024 at 06:05:07PM +0000, Lorenzo Stoakes wrote:
> >This series carries on the work the work started in previous series and
> ^^^ ^^^
>
> Duplicated?
Thanks yes, but trivial enough that I'm not sure it's worth a
correction. Will fix if need to respin.
>
> >continued in commit 52956b0d7fb9 ("mm: isolate mmap internal logic to
> >mm/vma.c"), moving the remainder of memory mapping implementation details
> >logic into mm/vma.c allowing the bulk of the mapping logic to be unit
> >tested.
> >
> >It is highly useful to do so, as this means we can both fundamentally test
> >this core logic, and introduce regression tests to ensure any issues
> >previously resolved do not recur.
> >
> >Vitally, this includes the do_brk_flags() function, meaning we have both
> >core means of userland mapping memory now testable.
> >
> >Performance testing was performed after this change given the brk() system
> >call's sensitivity to change, and no performance regression was observed.
>
> May I ask what performance test is done?
mmtests brk1, brk2 (will-it-scale)
You'd not really expect an impact based on relocation of this code, but
with brk it's always worth checking...
>
>
> --
> Wei Yang
> Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock
2024-12-05 0:18 ` Wei Yang
2024-12-05 7:01 ` Lorenzo Stoakes
@ 2024-12-05 7:06 ` Lorenzo Stoakes
1 sibling, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-05 7:06 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Eric Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-mm, linux-fsdevel, linux-kernel
On Thu, Dec 05, 2024 at 12:18:19AM +0000, Wei Yang wrote:
> On Tue, Dec 03, 2024 at 06:05:10PM +0000, Lorenzo Stoakes wrote:
> >Right now fs/exec.c invokes expand_downwards(), an otherwise internal
> >implementation detail of the VMA logic in order to ensure that an arg page
> >can be obtained by get_user_pages_remote().
> >
> >In order to be able to move the stack expansion logic into mm/vma.c in
> >order to make it available to userland testing we need to find an
>
> Looks the second "in order" is not necessary.
>
> Not a native speaker, just my personal feeling.
>
> >alternative approach here.
Sorry missed this one.
You're right this is clunky (non-native speakers often find this better
than native speakers to whom clunky turn of phrase can be more easily
overlooked I imagine).
Second 'in order to' should be 'to' really, but I'm not sure this is
important enough to take pains to address, will fix if a respin is
otherwise needed.
> >
> >We do so by providing the mmap_read_lock_maybe_expand() function which also
> >helpfully documents what get_arg_page() is doing here and adds an
> >additional check against VM_GROWSDOWN to make explicit that the stack
> >expansion logic is only invoked when the VMA is indeed a downward-growing
> >stack.
> >
> >This allows expand_downwards() to become a static function.
> >
> >Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be
> >currently user-visible in any way, that is place within an rmap or VMA
> >tree. It must be a newly allocated VMA.
> >
> >This is the case when exec invokes this function.
> >
> >Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >---
> > fs/exec.c | 14 +++---------
> > include/linux/mm.h | 5 ++---
> > mm/mmap.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-
> > 3 files changed, 58 insertions(+), 15 deletions(-)
> >
> >diff --git a/fs/exec.c b/fs/exec.c
> >index 98cb7ba9983c..1e1f79c514de 100644
> >--- a/fs/exec.c
> >+++ b/fs/exec.c
> >@@ -205,18 +205,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> > /*
> > * Avoid relying on expanding the stack down in GUP (which
> > * does not work for STACK_GROWSUP anyway), and just do it
> >- * by hand ahead of time.
> >+ * ahead of time.
> > */
> >- if (write && pos < vma->vm_start) {
> >- mmap_write_lock(mm);
> >- ret = expand_downwards(vma, pos);
> >- if (unlikely(ret < 0)) {
> >- mmap_write_unlock(mm);
> >- return NULL;
> >- }
> >- mmap_write_downgrade(mm);
> >- } else
> >- mmap_read_lock(mm);
> >+ if (!mmap_read_lock_maybe_expand(mm, vma, pos, write))
> >+ return NULL;
> >
> > /*
> > * We are doing an exec(). 'current' is the process
> >diff --git a/include/linux/mm.h b/include/linux/mm.h
> >index 4eb8e62d5c67..48312a934454 100644
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -3313,6 +3313,8 @@ extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admi
> > extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> > extern void exit_mmap(struct mm_struct *);
> > int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift);
> >+bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> >+ unsigned long addr, bool write);
> >
> > static inline int check_data_rlimit(unsigned long rlim,
> > unsigned long new,
> >@@ -3426,9 +3428,6 @@ extern unsigned long stack_guard_gap;
> > int expand_stack_locked(struct vm_area_struct *vma, unsigned long address);
> > struct vm_area_struct *expand_stack(struct mm_struct * mm, unsigned long addr);
> >
> >-/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */
> >-int expand_downwards(struct vm_area_struct *vma, unsigned long address);
> >-
> > /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */
> > extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
> > extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
> >diff --git a/mm/mmap.c b/mm/mmap.c
> >index f053de1d6fae..4df38d3717ff 100644
> >--- a/mm/mmap.c
> >+++ b/mm/mmap.c
> >@@ -1009,7 +1009,7 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > * vma is the first one with address < vma->vm_start. Have to extend vma.
> > * mmap_lock held for writing.
> > */
> >-int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> >+static int expand_downwards(struct vm_area_struct *vma, unsigned long address)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > struct vm_area_struct *prev;
> >@@ -1940,3 +1940,55 @@ int relocate_vma_down(struct vm_area_struct *vma, unsigned long shift)
> > /* Shrink the vma to just the new range */
> > return vma_shrink(&vmi, vma, new_start, new_end, vma->vm_pgoff);
> > }
> >+
> >+#ifdef CONFIG_MMU
> >+/*
> >+ * Obtain a read lock on mm->mmap_lock, if the specified address is below the
> >+ * start of the VMA, the intent is to perform a write, and it is a
> >+ * downward-growing stack, then attempt to expand the stack to contain it.
> >+ *
> >+ * This function is intended only for obtaining an argument page from an ELF
> >+ * image, and is almost certainly NOT what you want to use for any other
> >+ * purpose.
> >+ *
> >+ * IMPORTANT - VMA fields are accessed without an mmap lock being held, so the
> >+ * VMA referenced must not be linked in any user-visible tree, i.e. it must be a
> >+ * new VMA being mapped.
> >+ *
> >+ * The function assumes that addr is either contained within the VMA or below
> >+ * it, and makes no attempt to validate this value beyond that.
> >+ *
> >+ * Returns true if the read lock was obtained and a stack was perhaps expanded,
> >+ * false if the stack expansion failed.
> >+ *
> >+ * On stack expansion the function temporarily acquires an mmap write lock
> >+ * before downgrading it.
> >+ */
> >+bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
> >+ struct vm_area_struct *new_vma,
> >+ unsigned long addr, bool write)
> >+{
> >+ if (!write || addr >= new_vma->vm_start) {
> >+ mmap_read_lock(mm);
> >+ return true;
> >+ }
> >+
> >+ if (!(new_vma->vm_flags & VM_GROWSDOWN))
> >+ return false;
> >+
>
> In expand_downwards() we have this checked.
>
> Maybe we just leave this done in one place is enough?
>
> >+ mmap_write_lock(mm);
> >+ if (expand_downwards(new_vma, addr)) {
> >+ mmap_write_unlock(mm);
> >+ return false;
> >+ }
> >+
> >+ mmap_write_downgrade(mm);
> >+ return true;
> >+}
> >+#else
> >+bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> >+ unsigned long addr, bool write)
> >+{
> >+ return false;
> >+}
> >+#endif
> >--
> >2.47.1
> >
>
> --
> Wei Yang
> Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/5] mm/vma: make more mmap logic userland testable
2024-12-05 7:03 ` Lorenzo Stoakes
@ 2024-12-06 0:30 ` Wei Yang
2024-12-09 10:35 ` Lorenzo Stoakes
0 siblings, 1 reply; 21+ messages in thread
From: Wei Yang @ 2024-12-06 0:30 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Wei Yang, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Jann Horn, Eric Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
linux-kernel
On Thu, Dec 05, 2024 at 07:03:08AM +0000, Lorenzo Stoakes wrote:
>On Wed, Dec 04, 2024 at 11:56:32PM +0000, Wei Yang wrote:
>> On Tue, Dec 03, 2024 at 06:05:07PM +0000, Lorenzo Stoakes wrote:
>> >This series carries on the work the work started in previous series and
>> ^^^ ^^^
>>
>> Duplicated?
>
>Thanks yes, but trivial enough that I'm not sure it's worth a
>correction. Will fix if need to respin.
>
>>
>> >continued in commit 52956b0d7fb9 ("mm: isolate mmap internal logic to
>> >mm/vma.c"), moving the remainder of memory mapping implementation details
>> >logic into mm/vma.c allowing the bulk of the mapping logic to be unit
>> >tested.
>> >
>> >It is highly useful to do so, as this means we can both fundamentally test
>> >this core logic, and introduce regression tests to ensure any issues
>> >previously resolved do not recur.
>> >
>> >Vitally, this includes the do_brk_flags() function, meaning we have both
>> >core means of userland mapping memory now testable.
>> >
>> >Performance testing was performed after this change given the brk() system
>> >call's sensitivity to change, and no performance regression was observed.
>>
>> May I ask what performance test is done?
>
>mmtests brk1, brk2 (will-it-scale)
The one from here ?
https://github.com/gormanm/mmtests
>
>You'd not really expect an impact based on relocation of this code, but
>with brk it's always worth checking...
>
Yes, I am trying to know usually what perform test we would use.
>>
>>
>> --
>> Wei Yang
>> Help you, Help me
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock
2024-12-05 7:01 ` Lorenzo Stoakes
@ 2024-12-08 11:27 ` Wei Yang
2024-12-09 10:47 ` Lorenzo Stoakes
0 siblings, 1 reply; 21+ messages in thread
From: Wei Yang @ 2024-12-08 11:27 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Wei Yang, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
Jann Horn, Eric Biederman, Kees Cook, Alexander Viro,
Christian Brauner, Jan Kara, linux-mm, linux-fsdevel,
linux-kernel
On Thu, Dec 05, 2024 at 07:01:14AM +0000, Lorenzo Stoakes wrote:
[...]
>>
>> Maybe we just leave this done in one place is enough?
>
>Wei, I feel like I have repeated myself about 'mathematically smallest
>code' rather too many times at this stage. Doing an unsolicited drive-by
>review applying this concept, which I have roundly and clearly rejected, is
>not appreciated.
>
Hi, Lorenzo
I would apologize for introducing this un-pleasant mail. Would be more
thoughtful next time.
>At any rate, we are checking this _before the mmap lock is acquired_. It is
>also self-documenting.
>
>Please try to take on board the point that there are many factors when it
>comes to writing kernel code, aversion to possibly generated branches being
>only one of them.
>
Thanks for this suggestion.
I am trying to be as professional as you are. In case you have other
suggestions, they are welcome.
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 0/5] mm/vma: make more mmap logic userland testable
2024-12-06 0:30 ` Wei Yang
@ 2024-12-09 10:35 ` Lorenzo Stoakes
0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-09 10:35 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Eric Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-mm, linux-fsdevel, linux-kernel
On Fri, Dec 06, 2024 at 12:30:54AM +0000, Wei Yang wrote:
> On Thu, Dec 05, 2024 at 07:03:08AM +0000, Lorenzo Stoakes wrote:
> >On Wed, Dec 04, 2024 at 11:56:32PM +0000, Wei Yang wrote:
> >> On Tue, Dec 03, 2024 at 06:05:07PM +0000, Lorenzo Stoakes wrote:
> >> >This series carries on the work the work started in previous series and
> >> ^^^ ^^^
> >>
> >> Duplicated?
> >
> >Thanks yes, but trivial enough that I'm not sure it's worth a
> >correction. Will fix if need to respin.
> >
> >>
> >> >continued in commit 52956b0d7fb9 ("mm: isolate mmap internal logic to
> >> >mm/vma.c"), moving the remainder of memory mapping implementation details
> >> >logic into mm/vma.c allowing the bulk of the mapping logic to be unit
> >> >tested.
> >> >
> >> >It is highly useful to do so, as this means we can both fundamentally test
> >> >this core logic, and introduce regression tests to ensure any issues
> >> >previously resolved do not recur.
> >> >
> >> >Vitally, this includes the do_brk_flags() function, meaning we have both
> >> >core means of userland mapping memory now testable.
> >> >
> >> >Performance testing was performed after this change given the brk() system
> >> >call's sensitivity to change, and no performance regression was observed.
> >>
> >> May I ask what performance test is done?
> >
> >mmtests brk1, brk2 (will-it-scale)
>
> The one from here ?
>
> https://github.com/gormanm/mmtests
Yes
>
> >
> >You'd not really expect an impact based on relocation of this code, but
> >with brk it's always worth checking...
> >
>
> Yes, I am trying to know usually what perform test we would use.
Mel's tests also pull in from the will-it-scale project [0], which these brk
tests I'm referring to originate. The mmtest logic just performs some
statistical analysis and comparisons etc. across a number of different test
sources.
[0]:https://github.com/antonblanchard/will-it-scale
>
> >>
> >>
> >> --
> >> Wei Yang
> >> Help you, Help me
>
> --
> Wei Yang
> Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock
2024-12-08 11:27 ` Wei Yang
@ 2024-12-09 10:47 ` Lorenzo Stoakes
0 siblings, 0 replies; 21+ messages in thread
From: Lorenzo Stoakes @ 2024-12-09 10:47 UTC (permalink / raw)
To: Wei Yang
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Eric Biederman, Kees Cook, Alexander Viro, Christian Brauner,
Jan Kara, linux-mm, linux-fsdevel, linux-kernel
On Sun, Dec 08, 2024 at 11:27:06AM +0000, Wei Yang wrote:
> On Thu, Dec 05, 2024 at 07:01:14AM +0000, Lorenzo Stoakes wrote:
> [...]
> >>
> >> Maybe we just leave this done in one place is enough?
> >
> >Wei, I feel like I have repeated myself about 'mathematically smallest
> >code' rather too many times at this stage. Doing an unsolicited drive-by
> >review applying this concept, which I have roundly and clearly rejected, is
> >not appreciated.
> >
>
> Hi, Lorenzo
>
> I would apologize for introducing this un-pleasant mail. Would be more
> thoughtful next time.
It wasn't unpleasant, don't worry! :) I'm just trying to be as clear as I
can about this so as to avoid you spending time on things that aren't
useful.
On occasion I think, for clarity, it's important to be firm - otherwise if
I were receptive even on things that I thought not worthwhile - you'd end
up wasting your time doing work that might end up not being used.
>
> >At any rate, we are checking this _before the mmap lock is acquired_. It is
> >also self-documenting.
> >
> >Please try to take on board the point that there are many factors when it
> >comes to writing kernel code, aversion to possibly generated branches being
> >only one of them.
> >
>
> Thanks for this suggestion.
>
> I am trying to be as professional as you are. In case you have other
> suggestions, they are welcome.
Thanks, what I would say is that simply observing that we might duplicate
some logic in a non-harmful way does not necessarily indicate that this
should be changed.
Obviously if you can evidence a _statistically significant_ performance
impact then OF COURSE you should report something like this and send a
patch for it (along side the evidence of the perf regression).
But in general, if you feel the need to refactor just for the sake of
eliminating branches or grouping code like this, it isn't helpful or
useful.
Refactorings can be very useful _in general_ (I have done a lot of work on
things like this myself obviously), but it's important to assess the RoI -
is the churn worth the benefit - does it make significant enough of an
impact - and is it 'tasteful'?
These things are at least somewhat subjective obviously.
What I would suggest you focus on instead is in finding bugs - your help in
finding the bug where I (ugh) set a pointer to an error value in a case
where, if you were unlucky, it might be dereferenced - was a really helpful
contribution, as you can tell from how quickly we approved it and arranged
for backporting.
I'd say this ought to be your focus. For instance [0] was a horrid mistake
on my part, and ripe for being discovered. Having a second pair of eyes
checking for this kind of thing would be very useful, and discovering this
kind of bug as early as possible would be hugely valued.
So yeah TL;DR my advice is at the moment - focus on finding bugs above all
else :)
Cheers, Lorenzo
[0]:https://lore.kernel.org/linux-mm/20241206215229.244413-1-lorenzo.stoakes@oracle.com/
>
>
> --
> Wei Yang
> Help you, Help me
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock
2024-12-03 18:05 ` [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock Lorenzo Stoakes
2024-12-05 0:18 ` Wei Yang
@ 2024-12-10 17:14 ` Jann Horn
2024-12-14 1:05 ` Kees Cook
2 siblings, 0 replies; 21+ messages in thread
From: Jann Horn @ 2024-12-10 17:14 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Eric Biederman,
Kees Cook, Alexander Viro, Christian Brauner, Jan Kara, linux-mm,
linux-fsdevel, linux-kernel
On Tue, Dec 3, 2024 at 7:05 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> Right now fs/exec.c invokes expand_downwards(), an otherwise internal
> implementation detail of the VMA logic in order to ensure that an arg page
> can be obtained by get_user_pages_remote().
>
> In order to be able to move the stack expansion logic into mm/vma.c in
> order to make it available to userland testing we need to find an
> alternative approach here.
>
> We do so by providing the mmap_read_lock_maybe_expand() function which also
> helpfully documents what get_arg_page() is doing here and adds an
> additional check against VM_GROWSDOWN to make explicit that the stack
> expansion logic is only invoked when the VMA is indeed a downward-growing
> stack.
>
> This allows expand_downwards() to become a static function.
>
> Importantly, the VMA referenced by mmap_read_maybe_expand() must NOT be
> currently user-visible in any way, that is place within an rmap or VMA
> tree. It must be a newly allocated VMA.
>
> This is the case when exec invokes this function.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> fs/exec.c | 14 +++---------
> include/linux/mm.h | 5 ++---
> mm/mmap.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 58 insertions(+), 15 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 98cb7ba9983c..1e1f79c514de 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -205,18 +205,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> /*
> * Avoid relying on expanding the stack down in GUP (which
> * does not work for STACK_GROWSUP anyway), and just do it
> - * by hand ahead of time.
> + * ahead of time.
> */
> - if (write && pos < vma->vm_start) {
> - mmap_write_lock(mm);
> - ret = expand_downwards(vma, pos);
> - if (unlikely(ret < 0)) {
> - mmap_write_unlock(mm);
> - return NULL;
> - }
> - mmap_write_downgrade(mm);
> - } else
> - mmap_read_lock(mm);
> + if (!mmap_read_lock_maybe_expand(mm, vma, pos, write))
> + return NULL;
[...]
> +/*
> + * Obtain a read lock on mm->mmap_lock, if the specified address is below the
> + * start of the VMA, the intent is to perform a write, and it is a
> + * downward-growing stack, then attempt to expand the stack to contain it.
> + *
> + * This function is intended only for obtaining an argument page from an ELF
> + * image, and is almost certainly NOT what you want to use for any other
> + * purpose.
> + *
> + * IMPORTANT - VMA fields are accessed without an mmap lock being held, so the
> + * VMA referenced must not be linked in any user-visible tree, i.e. it must be a
> + * new VMA being mapped.
> + *
> + * The function assumes that addr is either contained within the VMA or below
> + * it, and makes no attempt to validate this value beyond that.
> + *
> + * Returns true if the read lock was obtained and a stack was perhaps expanded,
> + * false if the stack expansion failed.
> + *
> + * On stack expansion the function temporarily acquires an mmap write lock
> + * before downgrading it.
> + */
> +bool mmap_read_lock_maybe_expand(struct mm_struct *mm,
> + struct vm_area_struct *new_vma,
> + unsigned long addr, bool write)
> +{
> + if (!write || addr >= new_vma->vm_start) {
> + mmap_read_lock(mm);
> + return true;
> + }
> +
> + if (!(new_vma->vm_flags & VM_GROWSDOWN))
> + return false;
> +
> + mmap_write_lock(mm);
> + if (expand_downwards(new_vma, addr)) {
> + mmap_write_unlock(mm);
> + return false;
> + }
> +
> + mmap_write_downgrade(mm);
> + return true;
> +}
Random thought: For write==1, this looks a bit like
lock_mm_and_find_vma(mm, addr, NULL), which needs similar stack
expansion logic for handling userspace faults. But it's for a
sufficiently different situation that maybe it makes sense to keep it
like you did it, as a separate function...
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock
2024-12-03 18:05 ` [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock Lorenzo Stoakes
2024-12-05 0:18 ` Wei Yang
2024-12-10 17:14 ` Jann Horn
@ 2024-12-14 1:05 ` Kees Cook
2 siblings, 0 replies; 21+ messages in thread
From: Kees Cook @ 2024-12-14 1:05 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
Eric Biederman, Alexander Viro, Christian Brauner, Jan Kara,
linux-mm, linux-fsdevel, linux-kernel
On Tue, Dec 03, 2024 at 06:05:10PM +0000, Lorenzo Stoakes wrote:
> Right now fs/exec.c invokes expand_downwards(), an otherwise internal
> implementation detail of the VMA logic in order to ensure that an arg page
> can be obtained by get_user_pages_remote().
I think this is already in -next, but yeah, this looks good. It does mix
some logic changes, but it's pretty minor.
> - if (write && pos < vma->vm_start) {
> - mmap_write_lock(mm);
> - ret = expand_downwards(vma, pos);
> - if (unlikely(ret < 0)) {
> - mmap_write_unlock(mm);
> - return NULL;
> - }
> - mmap_write_downgrade(mm);
> - } else
> - mmap_read_lock(mm);
> + if (!mmap_read_lock_maybe_expand(mm, vma, pos, write))
> + return NULL;
>
> [...]
> + if (!write || addr >= new_vma->vm_start) {
> + mmap_read_lock(mm);
> + return true;
> + }
> +
> + if (!(new_vma->vm_flags & VM_GROWSDOWN))
> + return false;
The VM_GROWSDOWN check is moved around a bit. It seems okay, though.
> +
> + mmap_write_lock(mm);
> + if (expand_downwards(new_vma, addr)) {
> + mmap_write_unlock(mm);
> + return false;
> + }
> +
> + mmap_write_downgrade(mm);
> + return true;
I actually find this much more readable now since it follows the more
common "bail out early" coding style.
Acked-by: Kees Cook <kees@kernel.org>
Thanks!
-Kees
--
Kees Cook
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-12-14 1:05 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-03 18:05 [PATCH 0/5] mm/vma: make more mmap logic userland testable Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 1/5] mm/vma: move brk() internals to mm/vma.c Lorenzo Stoakes
2024-12-04 11:55 ` kernel test robot
2024-12-04 12:10 ` Lorenzo Stoakes
2024-12-04 12:08 ` Lorenzo Stoakes
2024-12-04 13:10 ` kernel test robot
2024-12-03 18:05 ` [PATCH 2/5] mm/vma: move unmapped_area() " Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 3/5] mm: abstract get_arg_page() stack expansion and mmap read lock Lorenzo Stoakes
2024-12-05 0:18 ` Wei Yang
2024-12-05 7:01 ` Lorenzo Stoakes
2024-12-08 11:27 ` Wei Yang
2024-12-09 10:47 ` Lorenzo Stoakes
2024-12-05 7:06 ` Lorenzo Stoakes
2024-12-10 17:14 ` Jann Horn
2024-12-14 1:05 ` Kees Cook
2024-12-03 18:05 ` [PATCH 4/5] mm/vma: move stack expansion logic to mm/vma.c Lorenzo Stoakes
2024-12-03 18:05 ` [PATCH 5/5] mm/vma: move __vm_munmap() " Lorenzo Stoakes
2024-12-04 23:56 ` [PATCH 0/5] mm/vma: make more mmap logic userland testable Wei Yang
2024-12-05 7:03 ` Lorenzo Stoakes
2024-12-06 0:30 ` Wei Yang
2024-12-09 10:35 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox