[PATCH v2 0/8] fix error handling in mmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/8] fix error handling in mmap_region() and refactor
@ 2024-10-23 20:38 Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
                   ` (7 more replies)
  0 siblings, 8 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.

A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently observed
issues with resource leaks and observable inconsistent state.

This series goes to great lengths to simplify how mmap_region() works and
to avoid unwinding errors late on in the process of setting up the VMA for
the new mapping, and equally avoids such operations occurring while the VMA
is in an inconsistent state.

The first four patches are intended for backporting to correct the
possibility of people encountering corrupted state while invoking mmap()
which is otherwise at risk of happening.

After this we go further, refactoring the code, placing it in mm/vma.c in
order to make it eventually userland testable, and significantly
simplifying the logic to avoid this issue arising in future.

v2:
* Marked first 4 patches as hotfixes, the rest as not.
* Improved comment in vma_close() as per Vlastiml.
* Updated hole byte count as per Jann.
* Updated comment in map_deny_write_exec() as per Jann.
* Dropped unnecessary vma_iter_free() as per Vlastmil, Liam.
* Corrected vms_abort_munmap_vmas() mistaken assumption about nr_pages as
  per Vlasitmil.
* Changed order of initial checks in mmap_region() to avoid user-visible
  side effects as per Vmastlil, Liam.
* Corrected silly incorrect use of vma field.
* Various style corrects as per Liam.
* Fix horrid mistake with merge VMA, reworked the logic to avoid that
  nonsense altogether.
* Add fields to map state rather than using vmg fields to avoid
  confusion/risk of vmg state changing breaking things.
* Replaced last commit removing merge retry with one that retries the
  merge, only sanely.

v1:
https://lore.kernel.org/all/cover.1729628198.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (8):
  mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  mm: unconditionally close VMAs on error
  mm: refactor map_deny_write_exec()
  mm: resolve faulty mmap_region() error path behaviour
  tools: testing: add additional vma_internal.h stubs
  mm: isolate mmap internal logic to mm/vma.c
  mm: refactor __mmap_region()
  mm: defer second attempt at merge on mmap()

 include/linux/mman.h             |  21 +-
 mm/internal.h                    |  45 ++++
 mm/mmap.c                        | 262 ++-----------------
 mm/mprotect.c                    |   2 +-
 mm/nommu.c                       |   7 +-
 mm/vma.c                         | 435 ++++++++++++++++++++++++++++++-
 mm/vma.h                         | 103 +-------
 mm/vma_internal.h                |   5 +
 tools/testing/vma/vma_internal.h | 115 +++++++-
 9 files changed, 634 insertions(+), 361 deletions(-)

--
2.47.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH hotfix 6.12 v2 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
@ 2024-10-23 20:38 ` Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

After an attempted mmap() fails, we are no longer in a situation where we
can safely interact with VMA hooks. This is currently not enforced, meaning
that we need complicated handling to ensure we do not incorrectly call
these hooks.

We can avoid the whole issue by treating the VMA as suspect the moment that
the file->f_ops->mmap() function reports an error by replacing whatever VMA
operations were installed with a dummy empty set of VMA operations.

We do so through a new helper function internal to mm - mmap_file() - which
is both more logically named than the existing call_mmap() function and
correctly isolates handling of the vm_op reassignment to mm.

All the existing invocations of call_mmap() outside of mm are ultimately
nested within the call_mmap() from mm, which we now replace.

It is therefore safe to leave call_mmap() in place as a convenience
function (and to avoid churn). The invokers are:

     ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap()
    coda_file_operations -> mmap -> coda_file_mmap()
     shm_file_operations -> shm_mmap()
shm_file_operations_huge -> shm_mmap()
            dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops
	                    -> i915_gem_dmabuf_mmap()

None of these callers interact with vm_ops or mappings in a problematic way
on error, quickly exiting out.

Reported-by: Jann Horn <jannh@google.com>
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Cc: stable <stable@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/internal.h | 27 +++++++++++++++++++++++++++
 mm/mmap.c     |  6 +++---
 mm/nommu.c    |  4 ++--
 3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 508f7802dd2b..af032e76dfd4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -108,6 +108,33 @@ static inline void *folio_raw_mapping(const struct folio *folio)
 	return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
 }

+/*
+ * This is a file-backed mapping, and is about to be memory mapped - invoke its
+ * mmap hook and safely handle error conditions. On error, VMA hooks will be
+ * mutated.
+ *
+ * @file: File which backs the mapping.
+ * @vma:  VMA which we are mapping.
+ *
+ * Returns: 0 if success, error otherwise.
+ */
+static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
+{
+	int err = call_mmap(file, vma);
+
+	if (likely(!err))
+		return 0;
+
+	/*
+	 * OK, we tried to call the file hook for mmap(), but an error
+	 * arose. The mapping is in an inconsistent state and we most not invoke
+	 * any further hooks on it.
+	 */
+	vma->vm_ops = &vma_dummy_vm_ops;
+
+	return err;
+}
+
 #ifdef CONFIG_MMU

 /* Flags for folio_pte_batch(). */
diff --git a/mm/mmap.c b/mm/mmap.c
index 1ba0878bbc30..10f4ccaf491b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1425,7 +1425,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	/*
 	 * clear PTEs while the vma is still in the tree so that rmap
 	 * cannot race with the freeing later in the truncate scenario.
-	 * This is also needed for call_mmap(), which is why vm_ops
+	 * This is also needed for mmap_file(), which is why vm_ops
 	 * close function is called.
 	 */
 	vms_clean_up_area(&vms, &mas_detach);
@@ -1450,7 +1450,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

 	if (file) {
 		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
+		error = mmap_file(file, vma);
 		if (error)
 			goto unmap_and_free_vma;

@@ -1473,7 +1473,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

 		vma_iter_config(&vmi, addr, end);
 		/*
-		 * If vm_flags changed after call_mmap(), we should try merge
+		 * If vm_flags changed after mmap_file(), we should try merge
 		 * vma again as we may succeed this time.
 		 */
 		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
diff --git a/mm/nommu.c b/mm/nommu.c
index 385b0c15add8..f9ccc02458ec 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -885,7 +885,7 @@ static int do_mmap_shared_file(struct vm_area_struct *vma)
 {
 	int ret;

-	ret = call_mmap(vma->vm_file, vma);
+	ret = mmap_file(vma->vm_file, vma);
 	if (ret == 0) {
 		vma->vm_region->vm_top = vma->vm_region->vm_end;
 		return 0;
@@ -918,7 +918,7 @@ static int do_mmap_private(struct vm_area_struct *vma,
 	 * happy.
 	 */
 	if (capabilities & NOMMU_MAP_DIRECT) {
-		ret = call_mmap(vma->vm_file, vma);
+		ret = mmap_file(vma->vm_file, vma);
 		/* shouldn't return success if we're not sharing */
 		if (WARN_ON_ONCE(!is_nommu_shared_mapping(vma->vm_flags)))
 			ret = -ENOSYS;
--
2.47.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH hotfix 6.12 v2 2/8] mm: unconditionally close VMAs on error
  2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
@ 2024-10-23 20:38 ` Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

Incorrect invocation of VMA callbacks when the VMA is no longer in a
consistent state is bug prone and risky to perform.

With regards to the important vm_ops->close() callback We have gone to
great lengths to try to track whether or not we ought to close VMAs.

Rather than doing so and risking making a mistake somewhere, instead
unconditionally close and reset vma->vm_ops to an empty dummy operations
set with a NULL .close operator.

We introduce a new function to do so - vma_close() - and simplify existing
vms logic which tracked whether we needed to close or not.

This simplifies the logic, avoids incorrect double-calling of the .close()
callback and allows us to update error paths to simply call vma_close()
unconditionally - making VMA closure idempotent.

Reported-by: Jann Horn <jannh@google.com>
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Cc: stable <stable@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/internal.h | 18 ++++++++++++++++++
 mm/mmap.c     |  5 ++---
 mm/nommu.c    |  3 +--
 mm/vma.c      | 14 +++++---------
 mm/vma.h      |  4 +---
 5 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index af032e76dfd4..0c4bf09bf788 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -135,6 +135,24 @@ static inline int mmap_file(struct file *file, struct vm_area_struct *vma)
 	return err;
 }

+/*
+ * If the VMA has a close hook then close it, and since closing it might leave
+ * it in an inconsistent state which makes the use of any hooks suspect, clear
+ * them down by installing dummy empty hooks.
+ */
+static inline void vma_close(struct vm_area_struct *vma)
+{
+	if (vma->vm_ops && vma->vm_ops->close) {
+		vma->vm_ops->close(vma);
+
+		/*
+		 * The mapping is in an inconsistent state, and no further hooks
+		 * may be invoked upon it.
+		 */
+		vma->vm_ops = &vma_dummy_vm_ops;
+	}
+}
+
 #ifdef CONFIG_MMU

 /* Flags for folio_pte_batch(). */
diff --git a/mm/mmap.c b/mm/mmap.c
index 10f4ccaf491b..d55c58e99a54 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1576,8 +1576,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	return addr;

 close_and_free_vma:
-	if (file && !vms.closed_vm_ops && vma->vm_ops && vma->vm_ops->close)
-		vma->vm_ops->close(vma);
+	vma_close(vma);

 	if (file || vma->vm_file) {
 unmap_and_free_vma:
@@ -1937,7 +1936,7 @@ void exit_mmap(struct mm_struct *mm)
 	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		remove_vma(vma, /* unreachable = */ true, /* closed = */ false);
+		remove_vma(vma, /* unreachable = */ true);
 		count++;
 		cond_resched();
 		vma = vma_next(&vmi);
diff --git a/mm/nommu.c b/mm/nommu.c
index f9ccc02458ec..635d028d647b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -589,8 +589,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
  */
 static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
 {
-	if (vma->vm_ops && vma->vm_ops->close)
-		vma->vm_ops->close(vma);
+	vma_close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	put_nommu_region(vma->vm_region);
diff --git a/mm/vma.c b/mm/vma.c
index 3c5a80876725..bb7cfa2dc282 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -323,11 +323,10 @@ static bool can_vma_merge_right(struct vma_merge_struct *vmg,
 /*
  * Close a vm structure and free it.
  */
-void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed)
+void remove_vma(struct vm_area_struct *vma, bool unreachable)
 {
 	might_sleep();
-	if (!closed && vma->vm_ops && vma->vm_ops->close)
-		vma->vm_ops->close(vma);
+	vma_close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
@@ -1115,9 +1114,7 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
 	vms_clear_ptes(vms, mas_detach, true);
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		if (vma->vm_ops && vma->vm_ops->close)
-			vma->vm_ops->close(vma);
-	vms->closed_vm_ops = true;
+		vma_close(vma);
 }

 /*
@@ -1160,7 +1157,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	/* Remove and clean up vmas */
 	mas_set(mas_detach, 0);
 	mas_for_each(mas_detach, vma, ULONG_MAX)
-		remove_vma(vma, /* = */ false, vms->closed_vm_ops);
+		remove_vma(vma, /* unreachable = */ false);

 	vm_unacct_memory(vms->nr_accounted);
 	validate_mm(mm);
@@ -1684,8 +1681,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	return new_vma;

 out_vma_link:
-	if (new_vma->vm_ops && new_vma->vm_ops->close)
-		new_vma->vm_ops->close(new_vma);
+	vma_close(new_vma);

 	if (new_vma->vm_file)
 		fput(new_vma->vm_file);
diff --git a/mm/vma.h b/mm/vma.h
index 55457cb68200..75558b5e9c8c 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -42,7 +42,6 @@ struct vma_munmap_struct {
 	int vma_count;                  /* Number of vmas that will be removed */
 	bool unlock;                    /* Unlock after the munmap */
 	bool clear_ptes;                /* If there are outstanding PTE to be cleared */
-	bool closed_vm_ops;		/* call_mmap() was encountered, so vmas may be closed */
 	/* 1 byte hole */
 	unsigned long nr_pages;         /* Number of pages being removed */
 	unsigned long locked_vm;        /* Number of locked pages */
@@ -198,7 +197,6 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
 	vms->unmap_start = FIRST_USER_ADDRESS;
 	vms->unmap_end = USER_PGTABLES_CEILING;
 	vms->clear_ptes = false;
-	vms->closed_vm_ops = false;
 }
 #endif

@@ -269,7 +267,7 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 		  unsigned long start, size_t len, struct list_head *uf,
 		  bool unlock);

-void remove_vma(struct vm_area_struct *vma, bool unreachable, bool closed);
+void remove_vma(struct vm_area_struct *vma, bool unreachable);

 void unmap_region(struct ma_state *mas, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct vm_area_struct *next);
--
2.47.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH hotfix 6.12 v2 3/8] mm: refactor map_deny_write_exec()
  2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
@ 2024-10-23 20:38 ` Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

Refactor the map_deny_write_exec() to not unnecessarily require a VMA
parameter but rather to accept VMA flags parameters, which allows us to use
this function early in mmap_region() in a subsequent commit.

While we're here, we refactor the function to be more readable and add some
additional documentation.

Reported-by: Jann Horn <jannh@google.com>
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Cc: stable <stable@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mman.h | 21 ++++++++++++++++++---
 mm/mmap.c            |  2 +-
 mm/mprotect.c        |  2 +-
 mm/vma.h             |  2 +-
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index bcb201ab7a41..8ddca62d6460 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -188,16 +188,31 @@ static inline bool arch_memory_deny_write_exec_supported(void)
  *
  *	d)	mmap(PROT_READ | PROT_EXEC)
  *		mmap(PROT_READ | PROT_EXEC | PROT_BTI)
+ *
+ * This is only applicable if the user has set the Memory-Deny-Write-Execute
+ * (MDWE) protection mask for the current process.
+ *
+ * @old specifies the VMA flags the VMA originally possessed, and @new the ones
+ * we propose to set.
+ *
+ * Return: false if proposed change is OK, true if not ok and should be denied.
  */
-static inline bool map_deny_write_exec(struct vm_area_struct *vma,  unsigned long vm_flags)
+static inline bool map_deny_write_exec(unsigned long old, unsigned long new)
 {
+	/* If MDWE is disabled, we have nothing to deny. */
 	if (!test_bit(MMF_HAS_MDWE, &current->mm->flags))
 		return false;

-	if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
+	/* If the new VMA is not executable, we have nothing to deny. */
+	if (!(new & VM_EXEC))
+		return false;
+
+	/* Under MDWE we do not accept newly writably executable VMAs... */
+	if (new & VM_WRITE)
 		return true;

-	if (!(vma->vm_flags & VM_EXEC) && (vm_flags & VM_EXEC))
+	/* ...nor previously non-executable VMAs becoming executable. */
+	if (!(old & VM_EXEC))
 		return true;

 	return false;
diff --git a/mm/mmap.c b/mm/mmap.c
index d55c58e99a54..66edf0ebba94 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1508,7 +1508,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vma_set_anonymous(vma);
 	}

-	if (map_deny_write_exec(vma, vma->vm_flags)) {
+	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
 		error = -EACCES;
 		goto close_and_free_vma;
 	}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0c5d6d06107d..6f450af3252e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -810,7 +810,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 			break;
 		}

-		if (map_deny_write_exec(vma, newflags)) {
+		if (map_deny_write_exec(vma->vm_flags, newflags)) {
 			error = -EACCES;
 			break;
 		}
diff --git a/mm/vma.h b/mm/vma.h
index 75558b5e9c8c..d58068c0ff2e 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -42,7 +42,7 @@ struct vma_munmap_struct {
 	int vma_count;                  /* Number of vmas that will be removed */
 	bool unlock;                    /* Unlock after the munmap */
 	bool clear_ptes;                /* If there are outstanding PTE to be cleared */
-	/* 1 byte hole */
+	/* 2 byte hole */
 	unsigned long nr_pages;         /* Number of pages being removed */
 	unsigned long locked_vm;        /* Number of locked pages */
 	unsigned long nr_accounted;     /* Number of VM_ACCOUNT pages */
--
2.47.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
@ 2024-10-23 20:38 ` Lorenzo Stoakes
  2024-10-28 18:29   ` Mark Brown
  2024-10-23 20:38 ` [PATCH v2 5/8] tools: testing: add additional vma_internal.h stubs Lorenzo Stoakes
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

The mmap_region() function is somewhat terrifying, with spaghetti-like
control flow and numerous means by which issues can arise and incomplete
state, memory leaks and other unpleasantness can occur.

A large amount of the complexity arises from trying to handle errors late
in the process of mapping a VMA, which forms the basis of recently observed
issues with resource leaks and observable inconsistent state.

Taking advantage of previous patches in this series we move a number of
checks earlier in the code, simplifying things by moving the core of the
logic into a static internal function __mmap_region().

Doing this allows us to perform a number of checks up front before we do
any real work, and allows us to unwind the writable unmap check
unconditionally as required and to perform a CONFIG_DEBUG_VM_MAPLE_TREE
validation unconditionally also.

We move a number of things here:

1. We preallocate memory for the iterator before we call the file-backed
   memory hook, allowing us to exit early and avoid having to perform
   complicated and error-prone close/free logic. We carefully free
   iterator state on both success and error paths.

2. The enclosing mmap_region() function handles the mapping_map_writable()
   logic early. Previously the logic had the mapping_map_writable() at the
   point of mapping a newly allocated file-backed VMA, and a matching
   mapping_unmap_writable() on success and error paths.

   We now do this unconditionally if this is a file-backed, shared writable
   mapping. If a driver changes the flags to eliminate VM_MAYWRITE, however
   doing so does not invalidate the seal check we just performed, and we in
   any case always decrement the counter in the wrapper.

   We perform a debug assert to ensure a driver does not attempt to do the
   opposite.

3. We also move arch_validate_flags() up into the mmap_region()
   function. This is only relevant on arm64 and sparc64, and the check is
   only meaningful for SPARC with ADI enabled. We explicitly add a warning
   for this arch if a driver invalidates this check, though the code ought
   eventually to be fixed to eliminate the need for this.

With all of these measures in place, we no longer need to explicitly close
the VMA on error paths, as we place all checks which might fail prior to a
call to any driver mmap hook.

This eliminates an entire class of errors, makes the code easier to reason
about and more robust.

Reported-by: Jann Horn <jannh@google.com>
Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails")
Cc: stable <stable@kernel.org>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c | 119 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 65 insertions(+), 54 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 66edf0ebba94..e686d57ed9f7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1361,20 +1361,18 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
 }

-unsigned long mmap_region(struct file *file, unsigned long addr,
+static unsigned long __mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
 		struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	pgoff_t pglen = PHYS_PFN(len);
-	struct vm_area_struct *merge;
 	unsigned long charged = 0;
 	struct vma_munmap_struct vms;
 	struct ma_state mas_detach;
 	struct maple_tree mt_detach;
 	unsigned long end = addr + len;
-	bool writable_file_mapping = false;
 	int error;
 	VMA_ITERATOR(vmi, mm, addr);
 	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
@@ -1448,28 +1446,26 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vm_flags_init(vma, vm_flags);
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);

+	if (vma_iter_prealloc(&vmi, vma)) {
+		error = -ENOMEM;
+		goto free_vma;
+	}
+
 	if (file) {
 		vma->vm_file = get_file(file);
 		error = mmap_file(file, vma);
 		if (error)
-			goto unmap_and_free_vma;
-
-		if (vma_is_shared_maywrite(vma)) {
-			error = mapping_map_writable(file->f_mapping);
-			if (error)
-				goto close_and_free_vma;
-
-			writable_file_mapping = true;
-		}
+			goto unmap_and_free_file_vma;

+		/* Drivers cannot alter the address of the VMA. */
+		WARN_ON_ONCE(addr != vma->vm_start);
 		/*
-		 * Expansion is handled above, merging is handled below.
-		 * Drivers should not alter the address of the VMA.
+		 * Drivers should not permit writability when previously it was
+		 * disallowed.
 		 */
-		if (WARN_ON((addr != vma->vm_start))) {
-			error = -EINVAL;
-			goto close_and_free_vma;
-		}
+		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
+				!(vm_flags & VM_MAYWRITE) &&
+				(vma->vm_flags & VM_MAYWRITE));

 		vma_iter_config(&vmi, addr, end);
 		/*
@@ -1477,6 +1473,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * vma again as we may succeed this time.
 		 */
 		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
+			struct vm_area_struct *merge;
+
 			vmg.flags = vma->vm_flags;
 			/* If this fails, state is reset ready for a reattempt. */
 			merge = vma_merge_new_range(&vmg);
@@ -1494,7 +1492,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 				vma = merge;
 				/* Update vm_flags to pick up the change. */
 				vm_flags = vma->vm_flags;
-				goto unmap_writable;
+				goto file_expanded;
 			}
 			vma_iter_config(&vmi, addr, end);
 		}
@@ -1503,26 +1501,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
 		if (error)
-			goto free_vma;
+			goto free_iter_vma;
 	} else {
 		vma_set_anonymous(vma);
 	}

-	if (map_deny_write_exec(vma->vm_flags, vma->vm_flags)) {
-		error = -EACCES;
-		goto close_and_free_vma;
-	}
-
-	/* Allow architectures to sanity-check the vm_flags */
-	if (!arch_validate_flags(vma->vm_flags)) {
-		error = -EINVAL;
-		goto close_and_free_vma;
-	}
-
-	if (vma_iter_prealloc(&vmi, vma)) {
-		error = -ENOMEM;
-		goto close_and_free_vma;
-	}
+#ifdef CONFIG_SPARC64
+	/* TODO: Fix SPARC ADI! */
+	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+#endif

 	/* Lock the VMA since it is modified after insertion into VMA tree */
 	vma_start_write(vma);
@@ -1536,10 +1523,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	 */
 	khugepaged_enter_vma(vma, vma->vm_flags);

-	/* Once vma denies write, undo our temporary denial count */
-unmap_writable:
-	if (writable_file_mapping)
-		mapping_unmap_writable(file->f_mapping);
+file_expanded:
 	file = vma->vm_file;
 	ksm_add_vma(vma);
 expanded:
@@ -1572,23 +1556,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

 	vma_set_page_prot(vma);

-	validate_mm(mm);
 	return addr;

-close_and_free_vma:
-	vma_close(vma);
-
-	if (file || vma->vm_file) {
-unmap_and_free_vma:
-		fput(vma->vm_file);
-		vma->vm_file = NULL;
+unmap_and_free_file_vma:
+	fput(vma->vm_file);
+	vma->vm_file = NULL;

-		vma_iter_set(&vmi, vma->vm_end);
-		/* Undo any partial mapping done by a device driver. */
-		unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
-	}
-	if (writable_file_mapping)
-		mapping_unmap_writable(file->f_mapping);
+	vma_iter_set(&vmi, vma->vm_end);
+	/* Undo any partial mapping done by a device driver. */
+	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
+free_iter_vma:
+	vma_iter_free(&vmi);
 free_vma:
 	vm_area_free(vma);
 unacct_error:
@@ -1598,10 +1576,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 abort_munmap:
 	vms_abort_munmap_vmas(&vms, &mas_detach);
 gather_failed:
-	validate_mm(mm);
 	return error;
 }

+unsigned long mmap_region(struct file *file, unsigned long addr,
+			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+			  struct list_head *uf)
+{
+	unsigned long ret;
+	bool writable_file_mapping = false;
+
+	/* Check to see if MDWE is applicable. */
+	if (map_deny_write_exec(vm_flags, vm_flags))
+		return -EACCES;
+
+	/* Allow architectures to sanity-check the vm_flags. */
+	if (!arch_validate_flags(vm_flags))
+		return -EINVAL;
+
+	/* Map writable and ensure this isn't a sealed memfd. */
+	if (file && is_shared_maywrite(vm_flags)) {
+		int error = mapping_map_writable(file->f_mapping);
+
+		if (error)
+			return error;
+		writable_file_mapping = true;
+	}
+
+	ret = __mmap_region(file, addr, len, vm_flags, pgoff, uf);
+
+	/* Clear our write mapping regardless of error. */
+	if (writable_file_mapping)
+		mapping_unmap_writable(file->f_mapping);
+
+	validate_mm(current->mm);
+	return ret;
+}
+
 static int __vm_munmap(unsigned long start, size_t len, bool unlock)
 {
 	int ret;
--
2.47.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 5/8] tools: testing: add additional vma_internal.h stubs
  2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
@ 2024-10-23 20:38 ` Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH v2 6/8] mm: isolate mmap internal logic to mm/vma.c Lorenzo Stoakes
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

Add some additional vma_internal.h stubs in preparation for __mmap_region()
being moved to mm/vma.c. Without these the move would result in the tests
no longer compiling.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 tools/testing/vma/vma_internal.h | 115 ++++++++++++++++++++++++++++++-
 1 file changed, 114 insertions(+), 1 deletion(-)

diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index c5b9da034511..77eba27365a2 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -44,7 +44,9 @@
 #define VM_LOCKED	0x00002000
 #define VM_IO           0x00004000
 #define VM_DONTEXPAND	0x00040000
+#define VM_LOCKONFAULT	0x00080000
 #define VM_ACCOUNT	0x00100000
+#define VM_NORESERVE	0x00200000
 #define VM_MIXEDMAP	0x10000000
 #define VM_STACK	VM_GROWSDOWN
 #define VM_SHADOW_STACK	VM_NONE
@@ -53,6 +55,14 @@
 #define VM_ACCESS_FLAGS (VM_READ | VM_WRITE | VM_EXEC)
 #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)

+/* This mask represents all the VMA flag bits used by mlock */
+#define VM_LOCKED_MASK	(VM_LOCKED | VM_LOCKONFAULT)
+
+#ifdef CONFIG_64BIT
+/* VM is sealed, in vm_flags */
+#define VM_SEALED	_BITUL(63)
+#endif
+
 #define FIRST_USER_ADDRESS	0UL
 #define USER_PGTABLES_CEILING	0UL

@@ -698,8 +708,9 @@ static inline void tlb_finish_mmu(struct mmu_gather *)
 {
 }

-static inline void get_file(struct file *)
+static inline struct file *get_file(struct file *f)
 {
+	return f;
 }

 static inline int vma_dup_policy(struct vm_area_struct *, struct vm_area_struct *)
@@ -920,4 +931,106 @@ static inline bool signal_pending(void *)
 	return false;
 }

+static inline bool is_file_hugepages(struct file *)
+{
+	return false;
+}
+
+static inline int security_vm_enough_memory_mm(struct mm_struct *, long)
+{
+	return true;
+}
+
+static inline bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long)
+{
+	return true;
+}
+
+static inline void vm_flags_init(struct vm_area_struct *vma,
+				 vm_flags_t flags)
+{
+	vma->__vm_flags = flags;
+}
+
+static inline void vm_flags_set(struct vm_area_struct *vma,
+				vm_flags_t flags)
+{
+	vma_start_write(vma);
+	vma->__vm_flags |= flags;
+}
+
+static inline void vm_flags_clear(struct vm_area_struct *vma,
+				  vm_flags_t flags)
+{
+	vma_start_write(vma);
+	vma->__vm_flags &= ~flags;
+}
+
+static inline int call_mmap(struct file *, struct vm_area_struct *)
+{
+	return 0;
+}
+
+static inline int shmem_zero_setup(struct vm_area_struct *)
+{
+	return 0;
+}
+
+static inline void vma_set_anonymous(struct vm_area_struct *vma)
+{
+	vma->vm_ops = NULL;
+}
+
+static inline void ksm_add_vma(struct vm_area_struct *)
+{
+}
+
+static inline void perf_event_mmap(struct vm_area_struct *)
+{
+}
+
+static inline bool vma_is_dax(struct vm_area_struct *)
+{
+	return false;
+}
+
+static inline struct vm_area_struct *get_gate_vma(struct mm_struct *)
+{
+	return NULL;
+}
+
+bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
+
+/* Update vma->vm_page_prot to reflect vma->vm_flags. */
+static inline void vma_set_page_prot(struct vm_area_struct *vma)
+{
+	unsigned long vm_flags = vma->vm_flags;
+	pgprot_t vm_page_prot;
+
+	/* testing: we inline vm_pgprot_modify() to avoid clash with vma.h. */
+	vm_page_prot = pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vm_flags));
+
+	if (vma_wants_writenotify(vma, vm_page_prot)) {
+		vm_flags &= ~VM_SHARED;
+		/* testing: we inline vm_pgprot_modify() to avoid clash with vma.h. */
+		vm_page_prot = pgprot_modify(vm_page_prot, vm_get_page_prot(vm_flags));
+	}
+	/* remove_protection_ptes reads vma->vm_page_prot without mmap_lock */
+	WRITE_ONCE(vma->vm_page_prot, vm_page_prot);
+}
+
+static inline bool arch_validate_flags(unsigned long)
+{
+	return true;
+}
+
+static inline void vma_close(struct vm_area_struct *)
+{
+}
+
+static inline int mmap_file(struct file *, struct vm_area_struct *)
+{
+	return 0;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
--
2.47.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 6/8] mm: isolate mmap internal logic to mm/vma.c
  2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2024-10-23 20:38 ` [PATCH v2 5/8] tools: testing: add additional vma_internal.h stubs Lorenzo Stoakes
@ 2024-10-23 20:38 ` Lorenzo Stoakes
  2024-10-24 17:23   ` Vlastimil Babka
  2024-10-23 20:38 ` [PATCH v2 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
  2024-10-23 20:38 ` [PATCH v2 8/8] mm: defer second attempt at merge on mmap() Lorenzo Stoakes
  7 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

In previous commits we effected improvements to the mmap() logic in
mmap_region() and its newly introduced internal implementation function
__mmap_region().

However as these changes are intended to be backported, we kept the delta
as small as is possible and made as few changes as possible to the newly
introduced mm/vma.* files.

Take the opportunity to move this logic to mm/vma.c which not only isolates
it, but also makes it available for later userland testing which can help
us catch such logic errors far earlier.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c         | 234 ---------------------------------
 mm/vma.c          | 323 +++++++++++++++++++++++++++++++++++++++++++++-
 mm/vma.h          |  97 +-------------
 mm/vma_internal.h |   5 +
 4 files changed, 329 insertions(+), 330 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index e686d57ed9f7..0affd1a0687f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -577,22 +577,6 @@ SYSCALL_DEFINE1(old_mmap, struct mmap_arg_struct __user *, arg)
 }
 #endif /* __ARCH_WANT_SYS_OLD_MMAP */
 
-/*
- * We account for memory if it's a private writeable mapping,
- * not hugepages and VM_NORESERVE wasn't set.
- */
-static inline bool accountable_mapping(struct file *file, vm_flags_t vm_flags)
-{
-	/*
-	 * hugetlb has its own accounting separate from the core VM
-	 * VM_HUGETLB may not be set yet so we cannot check for that flag.
-	 */
-	if (file && is_file_hugepages(file))
-		return false;
-
-	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
-}
-
 /**
  * unmapped_area() - Find an area between the low_limit and the high_limit with
  * the correct alignment and offset, all from @info. Note: current->mm is used
@@ -1361,224 +1345,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return do_vmi_munmap(&vmi, mm, start, len, uf, false);
 }
 
-static unsigned long __mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
-{
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	pgoff_t pglen = PHYS_PFN(len);
-	unsigned long charged = 0;
-	struct vma_munmap_struct vms;
-	struct ma_state mas_detach;
-	struct maple_tree mt_detach;
-	unsigned long end = addr + len;
-	int error;
-	VMA_ITERATOR(vmi, mm, addr);
-	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
-
-	vmg.file = file;
-	/* Find the first overlapping VMA */
-	vma = vma_find(&vmi, end);
-	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
-	if (vma) {
-		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
-		mt_on_stack(mt_detach);
-		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
-		/* Prepare to unmap any existing mapping in the area */
-		error = vms_gather_munmap_vmas(&vms, &mas_detach);
-		if (error)
-			goto gather_failed;
-
-		vmg.next = vms.next;
-		vmg.prev = vms.prev;
-		vma = NULL;
-	} else {
-		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
-	}
-
-	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
-		error = -ENOMEM;
-		goto abort_munmap;
-	}
-
-	/*
-	 * Private writable mapping: check memory availability
-	 */
-	if (accountable_mapping(file, vm_flags)) {
-		charged = pglen;
-		charged -= vms.nr_accounted;
-		if (charged) {
-			error = security_vm_enough_memory_mm(mm, charged);
-			if (error)
-				goto abort_munmap;
-		}
-
-		vms.nr_accounted = 0;
-		vm_flags |= VM_ACCOUNT;
-		vmg.flags = vm_flags;
-	}
-
-	/*
-	 * clear PTEs while the vma is still in the tree so that rmap
-	 * cannot race with the freeing later in the truncate scenario.
-	 * This is also needed for mmap_file(), which is why vm_ops
-	 * close function is called.
-	 */
-	vms_clean_up_area(&vms, &mas_detach);
-	vma = vma_merge_new_range(&vmg);
-	if (vma)
-		goto expanded;
-	/*
-	 * Determine the object being mapped and call the appropriate
-	 * specific mapper. the address has already been validated, but
-	 * not unmapped, but the maps are removed from the list.
-	 */
-	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
-		goto unacct_error;
-	}
-
-	vma_iter_config(&vmi, addr, end);
-	vma_set_range(vma, addr, end, pgoff);
-	vm_flags_init(vma, vm_flags);
-	vma->vm_page_prot = vm_get_page_prot(vm_flags);
-
-	if (vma_iter_prealloc(&vmi, vma)) {
-		error = -ENOMEM;
-		goto free_vma;
-	}
-
-	if (file) {
-		vma->vm_file = get_file(file);
-		error = mmap_file(file, vma);
-		if (error)
-			goto unmap_and_free_file_vma;
-
-		/* Drivers cannot alter the address of the VMA. */
-		WARN_ON_ONCE(addr != vma->vm_start);
-		/*
-		 * Drivers should not permit writability when previously it was
-		 * disallowed.
-		 */
-		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
-				!(vm_flags & VM_MAYWRITE) &&
-				(vma->vm_flags & VM_MAYWRITE));
-
-		vma_iter_config(&vmi, addr, end);
-		/*
-		 * If vm_flags changed after mmap_file(), we should try merge
-		 * vma again as we may succeed this time.
-		 */
-		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
-			struct vm_area_struct *merge;
-
-			vmg.flags = vma->vm_flags;
-			/* If this fails, state is reset ready for a reattempt. */
-			merge = vma_merge_new_range(&vmg);
-
-			if (merge) {
-				/*
-				 * ->mmap() can change vma->vm_file and fput
-				 * the original file. So fput the vma->vm_file
-				 * here or we would add an extra fput for file
-				 * and cause general protection fault
-				 * ultimately.
-				 */
-				fput(vma->vm_file);
-				vm_area_free(vma);
-				vma = merge;
-				/* Update vm_flags to pick up the change. */
-				vm_flags = vma->vm_flags;
-				goto file_expanded;
-			}
-			vma_iter_config(&vmi, addr, end);
-		}
-
-		vm_flags = vma->vm_flags;
-	} else if (vm_flags & VM_SHARED) {
-		error = shmem_zero_setup(vma);
-		if (error)
-			goto free_iter_vma;
-	} else {
-		vma_set_anonymous(vma);
-	}
-
-#ifdef CONFIG_SPARC64
-	/* TODO: Fix SPARC ADI! */
-	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
-#endif
-
-	/* Lock the VMA since it is modified after insertion into VMA tree */
-	vma_start_write(vma);
-	vma_iter_store(&vmi, vma);
-	mm->map_count++;
-	vma_link_file(vma);
-
-	/*
-	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
-	 * call covers the non-merge case.
-	 */
-	khugepaged_enter_vma(vma, vma->vm_flags);
-
-file_expanded:
-	file = vma->vm_file;
-	ksm_add_vma(vma);
-expanded:
-	perf_event_mmap(vma);
-
-	/* Unmap any existing mapping in the area */
-	vms_complete_munmap_vmas(&vms, &mas_detach);
-
-	vm_stat_account(mm, vm_flags, pglen);
-	if (vm_flags & VM_LOCKED) {
-		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
-					is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm))
-			vm_flags_clear(vma, VM_LOCKED_MASK);
-		else
-			mm->locked_vm += pglen;
-	}
-
-	if (file)
-		uprobe_mmap(vma);
-
-	/*
-	 * New (or expanded) vma always get soft dirty status.
-	 * Otherwise user-space soft-dirty page tracker won't
-	 * be able to distinguish situation when vma area unmapped,
-	 * then new mapped in-place (which must be aimed as
-	 * a completely new data area).
-	 */
-	vm_flags_set(vma, VM_SOFTDIRTY);
-
-	vma_set_page_prot(vma);
-
-	return addr;
-
-unmap_and_free_file_vma:
-	fput(vma->vm_file);
-	vma->vm_file = NULL;
-
-	vma_iter_set(&vmi, vma->vm_end);
-	/* Undo any partial mapping done by a device driver. */
-	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
-free_iter_vma:
-	vma_iter_free(&vmi);
-free_vma:
-	vm_area_free(vma);
-unacct_error:
-	if (charged)
-		vm_unacct_memory(charged);
-
-abort_munmap:
-	vms_abort_munmap_vmas(&vms, &mas_detach);
-gather_failed:
-	return error;
-}
-
 unsigned long mmap_region(struct file *file, unsigned long addr,
 			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
 			  struct list_head *uf)
diff --git a/mm/vma.c b/mm/vma.c
index bb7cfa2dc282..0a2965be582d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1103,7 +1103,7 @@ static inline void vms_clear_ptes(struct vma_munmap_struct *vms,
 	vms->clear_ptes = false;
 }
 
-void vms_clean_up_area(struct vma_munmap_struct *vms,
+static void vms_clean_up_area(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *vma;
@@ -1126,7 +1126,7 @@ void vms_clean_up_area(struct vma_munmap_struct *vms,
  * used for the munmap() and may downgrade the lock - if requested.  Everything
  * needed to be done once the vma maple tree is updated.
  */
-void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
+static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *vma;
@@ -1167,6 +1167,23 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
 	__mt_destroy(mas_detach->tree);
 }
 
+/*
+ * reattach_vmas() - Undo any munmap work and free resources
+ * @mas_detach: The maple state with the detached maple tree
+ *
+ * Reattach any detached vmas and free up the maple tree used to track the vmas.
+ */
+static void reattach_vmas(struct ma_state *mas_detach)
+{
+	struct vm_area_struct *vma;
+
+	mas_set(mas_detach, 0);
+	mas_for_each(mas_detach, vma, ULONG_MAX)
+		vma_mark_detached(vma, false);
+
+	__mt_destroy(mas_detach->tree);
+}
+
 /*
  * vms_gather_munmap_vmas() - Put all VMAs within a range into a maple tree
  * for removal at a later date.  Handles splitting first and last if necessary
@@ -1177,7 +1194,7 @@ void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
  *
  * Return: 0 on success, error otherwise
  */
-int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
+static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 		struct ma_state *mas_detach)
 {
 	struct vm_area_struct *next = NULL;
@@ -1315,6 +1332,39 @@ int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
 	return error;
 }
 
+/*
+ * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
+ * @vms: The vma munmap struct
+ * @vmi: The vma iterator
+ * @vma: The first vm_area_struct to munmap
+ * @start: The aligned start address to munmap
+ * @end: The aligned end address to munmap
+ * @uf: The userfaultfd list_head
+ * @unlock: Unlock after the operation.  Only unlocked on success
+ */
+static void init_vma_munmap(struct vma_munmap_struct *vms,
+		struct vma_iterator *vmi, struct vm_area_struct *vma,
+		unsigned long start, unsigned long end, struct list_head *uf,
+		bool unlock)
+{
+	vms->vmi = vmi;
+	vms->vma = vma;
+	if (vma) {
+		vms->start = start;
+		vms->end = end;
+	} else {
+		vms->start = vms->end = 0;
+	}
+	vms->unlock = unlock;
+	vms->uf = uf;
+	vms->vma_count = 0;
+	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
+	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
+	vms->unmap_start = FIRST_USER_ADDRESS;
+	vms->unmap_end = USER_PGTABLES_CEILING;
+	vms->clear_ptes = false;
+}
+
 /*
  * do_vmi_align_munmap() - munmap the aligned region from @start to @end.
  * @vmi: The vma iterator
@@ -2069,3 +2119,270 @@ void mm_drop_all_locks(struct mm_struct *mm)
 
 	mutex_unlock(&mm_all_locks_mutex);
 }
+
+/*
+ * We account for memory if it's a private writeable mapping,
+ * not hugepages and VM_NORESERVE wasn't set.
+ */
+static bool accountable_mapping(struct file *file, vm_flags_t vm_flags)
+{
+	/*
+	 * hugetlb has its own accounting separate from the core VM
+	 * VM_HUGETLB may not be set yet so we cannot check for that flag.
+	 */
+	if (file && is_file_hugepages(file))
+		return false;
+
+	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
+}
+
+/*
+ * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
+ * operation.
+ * @vms: The vma unmap structure
+ * @mas_detach: The maple state with the detached maple tree
+ *
+ * Reattach any detached vmas, free up the maple tree used to track the vmas.
+ * If that's not possible because the ptes are cleared (and vm_ops->closed() may
+ * have been called), then a NULL is written over the vmas and the vmas are
+ * removed (munmap() completed).
+ */
+static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
+		struct ma_state *mas_detach)
+{
+	struct ma_state *mas = &vms->vmi->mas;
+
+	if (!vms->nr_pages)
+		return;
+
+	if (vms->clear_ptes)
+		return reattach_vmas(mas_detach);
+
+	/*
+	 * Aborting cannot just call the vm_ops open() because they are often
+	 * not symmetrical and state data has been lost.  Resort to the old
+	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
+	 */
+	mas_set_range(mas, vms->start, vms->end - 1);
+	mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
+	/* Clean up the insertion of the unfortunate gap */
+	vms_complete_munmap_vmas(vms, mas_detach);
+}
+
+unsigned long __mmap_region(struct file *file, unsigned long addr,
+		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+		struct list_head *uf)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	pgoff_t pglen = PHYS_PFN(len);
+	unsigned long charged = 0;
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
+	struct maple_tree mt_detach;
+	unsigned long end = addr + len;
+	int error;
+	VMA_ITERATOR(vmi, mm, addr);
+	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
+
+	vmg.file = file;
+	/* Find the first overlapping VMA */
+	vma = vma_find(&vmi, end);
+	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
+	if (vma) {
+		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(mt_detach);
+		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+		/* Prepare to unmap any existing mapping in the area */
+		error = vms_gather_munmap_vmas(&vms, &mas_detach);
+		if (error)
+			goto gather_failed;
+
+		vmg.next = vms.next;
+		vmg.prev = vms.prev;
+		vma = NULL;
+	} else {
+		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
+	}
+
+	/* Check against address space limit. */
+	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
+		error = -ENOMEM;
+		goto abort_munmap;
+	}
+
+	/*
+	 * Private writable mapping: check memory availability
+	 */
+	if (accountable_mapping(file, vm_flags)) {
+		charged = pglen;
+		charged -= vms.nr_accounted;
+		if (charged) {
+			error = security_vm_enough_memory_mm(mm, charged);
+			if (error)
+				goto abort_munmap;
+		}
+
+		vms.nr_accounted = 0;
+		vm_flags |= VM_ACCOUNT;
+		vmg.flags = vm_flags;
+	}
+
+	/*
+	 * clear PTEs while the vma is still in the tree so that rmap
+	 * cannot race with the freeing later in the truncate scenario.
+	 * This is also needed for mmap_file(), which is why vm_ops
+	 * close function is called.
+	 */
+	vms_clean_up_area(&vms, &mas_detach);
+	vma = vma_merge_new_range(&vmg);
+	if (vma)
+		goto expanded;
+	/*
+	 * Determine the object being mapped and call the appropriate
+	 * specific mapper. the address has already been validated, but
+	 * not unmapped, but the maps are removed from the list.
+	 */
+	vma = vm_area_alloc(mm);
+	if (!vma) {
+		error = -ENOMEM;
+		goto unacct_error;
+	}
+
+	vma_iter_config(&vmi, addr, end);
+	vma_set_range(vma, addr, end, pgoff);
+	vm_flags_init(vma, vm_flags);
+	vma->vm_page_prot = vm_get_page_prot(vm_flags);
+
+	if (vma_iter_prealloc(&vmi, vma)) {
+		error = -ENOMEM;
+		goto free_vma;
+	}
+
+	if (file) {
+		vma->vm_file = get_file(file);
+		error = mmap_file(file, vma);
+		if (error)
+			goto unmap_and_free_file_vma;
+
+		/* Drivers cannot alter the address of the VMA. */
+		WARN_ON_ONCE(addr != vma->vm_start);
+		/*
+		 * Drivers should not permit writability when previously it was
+		 * disallowed.
+		 */
+		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
+				!(vm_flags & VM_MAYWRITE) &&
+				(vma->vm_flags & VM_MAYWRITE));
+
+		vma_iter_config(&vmi, addr, end);
+		/*
+		 * If vm_flags changed after mmap_file(), we should try merge
+		 * vma again as we may succeed this time.
+		 */
+		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
+			struct vm_area_struct *merge;
+
+			vmg.flags = vma->vm_flags;
+			/* If this fails, state is reset ready for a reattempt. */
+			merge = vma_merge_new_range(&vmg);
+
+			if (merge) {
+				/*
+				 * ->mmap() can change vma->vm_file and fput
+				 * the original file. So fput the vma->vm_file
+				 * here or we would add an extra fput for file
+				 * and cause general protection fault
+				 * ultimately.
+				 */
+				fput(vma->vm_file);
+				vm_area_free(vma);
+				vma = merge;
+				/* Update vm_flags to pick up the change. */
+				vm_flags = vma->vm_flags;
+				goto file_expanded;
+			}
+			vma_iter_config(&vmi, addr, end);
+		}
+
+		vm_flags = vma->vm_flags;
+	} else if (vm_flags & VM_SHARED) {
+		error = shmem_zero_setup(vma);
+		if (error)
+			goto free_iter_vma;
+	} else {
+		vma_set_anonymous(vma);
+	}
+
+#ifdef CONFIG_SPARC64
+	/* TODO: Fix SPARC ADI! */
+	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+#endif
+
+	/* Lock the VMA since it is modified after insertion into VMA tree */
+	vma_start_write(vma);
+	vma_iter_store(&vmi, vma);
+	mm->map_count++;
+	vma_link_file(vma);
+
+	/*
+	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
+	 * call covers the non-merge case.
+	 */
+	khugepaged_enter_vma(vma, vma->vm_flags);
+
+file_expanded:
+	file = vma->vm_file;
+	ksm_add_vma(vma);
+expanded:
+	perf_event_mmap(vma);
+
+	/* Unmap any existing mapping in the area */
+	vms_complete_munmap_vmas(&vms, &mas_detach);
+
+	vm_stat_account(mm, vm_flags, pglen);
+	if (vm_flags & VM_LOCKED) {
+		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
+					is_vm_hugetlb_page(vma) ||
+					vma == get_gate_vma(current->mm))
+			vm_flags_clear(vma, VM_LOCKED_MASK);
+		else
+			mm->locked_vm += pglen;
+	}
+
+	if (file)
+		uprobe_mmap(vma);
+
+	/*
+	 * New (or expanded) vma always get soft dirty status.
+	 * Otherwise user-space soft-dirty page tracker won't
+	 * be able to distinguish situation when vma area unmapped,
+	 * then new mapped in-place (which must be aimed as
+	 * a completely new data area).
+	 */
+	vm_flags_set(vma, VM_SOFTDIRTY);
+
+	vma_set_page_prot(vma);
+
+	return addr;
+
+unmap_and_free_file_vma:
+	fput(vma->vm_file);
+	vma->vm_file = NULL;
+
+	vma_iter_set(&vmi, vma->vm_end);
+	/* Undo any partial mapping done by a device driver. */
+	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
+free_iter_vma:
+	vma_iter_free(&vmi);
+free_vma:
+	vm_area_free(vma);
+unacct_error:
+	if (charged)
+		vm_unacct_memory(charged);
+
+abort_munmap:
+	vms_abort_munmap_vmas(&vms, &mas_detach);
+gather_failed:
+	return error;
+}
diff --git a/mm/vma.h b/mm/vma.h
index d58068c0ff2e..388d34748674 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -165,99 +165,6 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
 	return 0;
 }
 
-#ifdef CONFIG_MMU
-/*
- * init_vma_munmap() - Initializer wrapper for vma_munmap_struct
- * @vms: The vma munmap struct
- * @vmi: The vma iterator
- * @vma: The first vm_area_struct to munmap
- * @start: The aligned start address to munmap
- * @end: The aligned end address to munmap
- * @uf: The userfaultfd list_head
- * @unlock: Unlock after the operation.  Only unlocked on success
- */
-static inline void init_vma_munmap(struct vma_munmap_struct *vms,
-		struct vma_iterator *vmi, struct vm_area_struct *vma,
-		unsigned long start, unsigned long end, struct list_head *uf,
-		bool unlock)
-{
-	vms->vmi = vmi;
-	vms->vma = vma;
-	if (vma) {
-		vms->start = start;
-		vms->end = end;
-	} else {
-		vms->start = vms->end = 0;
-	}
-	vms->unlock = unlock;
-	vms->uf = uf;
-	vms->vma_count = 0;
-	vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
-	vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
-	vms->unmap_start = FIRST_USER_ADDRESS;
-	vms->unmap_end = USER_PGTABLES_CEILING;
-	vms->clear_ptes = false;
-}
-#endif
-
-int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach);
-
-void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach);
-
-void vms_clean_up_area(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach);
-
-/*
- * reattach_vmas() - Undo any munmap work and free resources
- * @mas_detach: The maple state with the detached maple tree
- *
- * Reattach any detached vmas and free up the maple tree used to track the vmas.
- */
-static inline void reattach_vmas(struct ma_state *mas_detach)
-{
-	struct vm_area_struct *vma;
-
-	mas_set(mas_detach, 0);
-	mas_for_each(mas_detach, vma, ULONG_MAX)
-		vma_mark_detached(vma, false);
-
-	__mt_destroy(mas_detach->tree);
-}
-
-/*
- * vms_abort_munmap_vmas() - Undo as much as possible from an aborted munmap()
- * operation.
- * @vms: The vma unmap structure
- * @mas_detach: The maple state with the detached maple tree
- *
- * Reattach any detached vmas, free up the maple tree used to track the vmas.
- * If that's not possible because the ptes are cleared (and vm_ops->closed() may
- * have been called), then a NULL is written over the vmas and the vmas are
- * removed (munmap() completed).
- */
-static inline void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
-		struct ma_state *mas_detach)
-{
-	struct ma_state *mas = &vms->vmi->mas;
-	if (!vms->nr_pages)
-		return;
-
-	if (vms->clear_ptes)
-		return reattach_vmas(mas_detach);
-
-	/*
-	 * Aborting cannot just call the vm_ops open() because they are often
-	 * not symmetrical and state data has been lost.  Resort to the old
-	 * failure method of leaving a gap where the MAP_FIXED mapping failed.
-	 */
-	mas_set_range(mas, vms->start, vms->end - 1);
-	mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
-	/* Clean up the insertion of the unfortunate gap */
-	vms_complete_munmap_vmas(vms, mas_detach);
-}
-
 int
 do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
@@ -336,6 +243,10 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
 int mm_take_all_locks(struct mm_struct *mm);
 void mm_drop_all_locks(struct mm_struct *mm);
 
+unsigned long __mmap_region(struct file *file, unsigned long addr,
+		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+		struct list_head *uf);
+
 static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma)
 {
 	/*
diff --git a/mm/vma_internal.h b/mm/vma_internal.h
index b930ab12a587..fc5f172a36bd 100644
--- a/mm/vma_internal.h
+++ b/mm/vma_internal.h
@@ -17,8 +17,10 @@
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/huge_mm.h>
+#include <linux/hugetlb.h>
 #include <linux/hugetlb_inline.h>
 #include <linux/kernel.h>
+#include <linux/ksm.h>
 #include <linux/khugepaged.h>
 #include <linux/list.h>
 #include <linux/maple_tree.h>
@@ -32,11 +34,14 @@
 #include <linux/mmu_context.h>
 #include <linux/mutex.h>
 #include <linux/pagemap.h>
+#include <linux/perf_event.h>
 #include <linux/pfn.h>
 #include <linux/rcupdate.h>
 #include <linux/rmap.h>
 #include <linux/rwsem.h>
 #include <linux/sched/signal.h>
+#include <linux/security.h>
+#include <linux/shmem_fs.h>
 #include <linux/swap.h>
 #include <linux/uprobes.h>
 #include <linux/userfaultfd_k.h>
-- 
2.47.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 7/8] mm: refactor __mmap_region()
  2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2024-10-23 20:38 ` [PATCH v2 6/8] mm: isolate mmap internal logic to mm/vma.c Lorenzo Stoakes
@ 2024-10-23 20:38 ` Lorenzo Stoakes
  2024-10-25  8:35   ` Vlastimil Babka
  2024-10-23 20:38 ` [PATCH v2 8/8] mm: defer second attempt at merge on mmap() Lorenzo Stoakes
  7 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

We have seen bugs and resource leaks arise from the complexity of the
__mmap_region() function. This, and the generally deeply fragile error
handling logic and complexity which makes understanding the function
difficult make it highly desirable to refactor it into something readable.

Achieve this by separating the function into smaller logical parts which
are easier to understand and follow, and which importantly very
significantly simplify the error handling.

Note that we now call vms_abort_munmap_vmas() in more error paths than we
used to, however in cases where no abort need occur, vms->nr_pages will be
equal to zero and we simply exit this function without doing more than we
would have done previously.

Importantly, the invocation of the driver mmap hook via mmap_file() now has
very simple and obvious handling (this was previously the most problematic
part of the mmap() operation).

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 401 ++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 261 insertions(+), 140 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 0a2965be582d..065f5e1f65be 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -7,6 +7,40 @@
 #include "vma_internal.h"
 #include "vma.h"
 
+struct mmap_state {
+	struct mm_struct *mm;
+	struct vma_iterator *vmi;
+
+	unsigned long addr;
+	unsigned long end;
+	pgoff_t pgoff;
+	unsigned long pglen;
+	unsigned long flags;
+	struct file *file;
+
+	unsigned long charged;
+
+	struct vm_area_struct *prev;
+	struct vm_area_struct *next;
+
+	/* Unmapping state. */
+	struct vma_munmap_struct vms;
+	struct ma_state mas_detach;
+	struct maple_tree mt_detach;
+};
+
+#define MMAP_STATE(name, mm_, vmi_, addr_, len_, pgoff_, flags_, file_) \
+	struct mmap_state name = {					\
+		.mm = mm_,						\
+		.vmi = vmi_,						\
+		.addr = addr_,						\
+		.end = addr + len,					\
+		.pgoff = pgoff_,					\
+		.pglen = PHYS_PFN(len_),				\
+		.flags = flags_,					\
+		.file = file_,						\
+	}
+
 static inline bool is_mergeable_vma(struct vma_merge_struct *vmg, bool merge_next)
 {
 	struct vm_area_struct *vma = merge_next ? vmg->next : vmg->prev;
@@ -2169,188 +2203,259 @@ static void vms_abort_munmap_vmas(struct vma_munmap_struct *vms,
 	vms_complete_munmap_vmas(vms, mas_detach);
 }
 
-unsigned long __mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
+/*
+ * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
+ * unmapped once the map operation is completed, check limits, account mapping
+ * and clean up any pre-existing VMAs.
+ *
+ * @map: Mapping state.
+ * @vmg: VMA merge state.
+ * @uf:  Userfaultfd context list.
+ *
+ * Returns: 0 on success, error code otherwise.
+ */
+static int __mmap_prepare(struct mmap_state *map, struct vma_merge_struct *vmg,
+	struct list_head *uf)
 {
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	pgoff_t pglen = PHYS_PFN(len);
-	unsigned long charged = 0;
-	struct vma_munmap_struct vms;
-	struct ma_state mas_detach;
-	struct maple_tree mt_detach;
-	unsigned long end = addr + len;
 	int error;
-	VMA_ITERATOR(vmi, mm, addr);
-	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
-
-	vmg.file = file;
-	/* Find the first overlapping VMA */
-	vma = vma_find(&vmi, end);
-	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
-	if (vma) {
-		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
-		mt_on_stack(mt_detach);
-		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
+	struct vma_iterator *vmi = map->vmi;
+	struct vma_munmap_struct *vms = &map->vms;
+
+	/* Find the first overlapping VMA and initialise unmap state. */
+	vms->vma = vma_find(vmi, map->end);
+	init_vma_munmap(vms, vmi, vms->vma, map->addr, map->end, uf,
+			/* unlock = */ false);
+
+	/* OK, we have overlapping VMAs - prepare to unmap them. */
+	if (vms->vma) {
+		mt_init_flags(&map->mt_detach,
+			      vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
+		mt_on_stack(map->mt_detach);
+		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
 		/* Prepare to unmap any existing mapping in the area */
-		error = vms_gather_munmap_vmas(&vms, &mas_detach);
-		if (error)
-			goto gather_failed;
+		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
+		if (error) {
+			/* On error VMAs will already have been reattached. */
+			vms->nr_pages = 0;
+			return error;
+		}
 
-		vmg.next = vms.next;
-		vmg.prev = vms.prev;
-		vma = NULL;
+		map->next = vms->next;
+		map->prev = vms->prev;
 	} else {
-		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
+		map->next = vma_iter_next_rewind(vmi, &map->prev);
 	}
 
+	/* Set up vmg for merge attempt. */
+	vmg->file = map->file;
+	vmg->prev = map->prev;
+	vmg->next = map->next;
+
 	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
-		error = -ENOMEM;
-		goto abort_munmap;
-	}
+	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
+		return -ENOMEM;
 
-	/*
-	 * Private writable mapping: check memory availability
-	 */
-	if (accountable_mapping(file, vm_flags)) {
-		charged = pglen;
-		charged -= vms.nr_accounted;
-		if (charged) {
-			error = security_vm_enough_memory_mm(mm, charged);
+	/* Private writable mapping: check memory availability. */
+	if (accountable_mapping(map->file, map->flags)) {
+		map->charged = map->pglen;
+		map->charged -= vms->nr_accounted;
+		if (map->charged) {
+			error = security_vm_enough_memory_mm(map->mm, map->charged);
 			if (error)
-				goto abort_munmap;
+				return error;
 		}
 
-		vms.nr_accounted = 0;
-		vm_flags |= VM_ACCOUNT;
-		vmg.flags = vm_flags;
+		vms->nr_accounted = 0;
+		map->flags |= VM_ACCOUNT;
 	}
 
 	/*
-	 * clear PTEs while the vma is still in the tree so that rmap
+	 * Clear PTEs while the vma is still in the tree so that rmap
 	 * cannot race with the freeing later in the truncate scenario.
 	 * This is also needed for mmap_file(), which is why vm_ops
 	 * close function is called.
 	 */
-	vms_clean_up_area(&vms, &mas_detach);
-	vma = vma_merge_new_range(&vmg);
-	if (vma)
-		goto expanded;
+	vms_clean_up_area(vms, &map->mas_detach);
+
+	return 0;
+}
+
+static int __mmap_new_file_vma(struct mmap_state *map,
+			       struct vma_merge_struct *vmg,
+			       struct vm_area_struct **vmap, bool *mergedp)
+{
+	struct vma_iterator *vmi = map->vmi;
+	struct vm_area_struct *vma = *vmap;
+	int error;
+
+	vma->vm_file = get_file(map->file);
+	error = mmap_file(vma->vm_file, vma);
+	if (error) {
+		fput(vma->vm_file);
+		vma->vm_file = NULL;
+
+		vma_iter_set(vmi, vma->vm_end);
+		/* Undo any partial mapping done by a device driver. */
+		unmap_region(&vmi->mas, vma, map->prev, map->next);
+
+		return error;
+	}
+
+	/* Drivers cannot alter the address of the VMA. */
+	WARN_ON_ONCE(map->addr != vma->vm_start);
+	/*
+	 * Drivers should not permit writability when previously it was
+	 * disallowed.
+	 */
+	VM_WARN_ON_ONCE(map->flags != vma->vm_flags &&
+			!(map->flags & VM_MAYWRITE) &&
+			(vma->vm_flags & VM_MAYWRITE));
+
+	vma_iter_config(vmi, map->addr, map->end);
+	/*
+	 * If flags changed after mmap_file(), we should try merge
+	 * vma again as we may succeed this time.
+	 */
+	if (unlikely(map->flags != vma->vm_flags && map->prev)) {
+		struct vm_area_struct *merge;
+
+		vmg->flags = vma->vm_flags;
+		/* If this fails, state is reset ready for a reattempt. */
+		merge = vma_merge_new_range(vmg);
+
+		if (merge) {
+			/*
+			 * ->mmap() can change vma->vm_file and fput
+			 * the original file. So fput the vma->vm_file
+			 * here or we would add an extra fput for file
+			 * and cause general protection fault
+			 * ultimately.
+			 */
+			fput(vma->vm_file);
+			vm_area_free(vma);
+			vma = merge;
+			*mergedp = true;
+		} else {
+			vma_iter_config(vmi, map->addr, map->end);
+		}
+	}
+
+	map->flags = vma->vm_flags;
+	*vmap = vma;
+	return 0;
+}
+
+/*
+ * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
+ * possible.
+ *
+ * An exception to this is if the mapping is file-backed, and the underlying
+ * driver changes the VMA flags, permitting a subsequent merge of the VMA, in
+ * which case the returned VMA is one that was merged on a second attempt.
+ *
+ * @map:  Mapping state.
+ * @vmg:  VMA merge state.
+ * @vmap: Output pointer for the new VMA.
+ *
+ * Returns: Zero on success, or an error.
+ */
+static int __mmap_new_vma(struct mmap_state *map, struct vma_merge_struct *vmg,
+			  struct vm_area_struct **vmap)
+{
+	struct vma_iterator *vmi = map->vmi;
+	int error = 0;
+	bool merged = false;
+	struct vm_area_struct *vma;
+
 	/*
 	 * Determine the object being mapped and call the appropriate
 	 * specific mapper. the address has already been validated, but
 	 * not unmapped, but the maps are removed from the list.
 	 */
-	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
-		goto unacct_error;
-	}
+	vma = vm_area_alloc(map->mm);
+	if (!vma)
+		return -ENOMEM;
 
-	vma_iter_config(&vmi, addr, end);
-	vma_set_range(vma, addr, end, pgoff);
-	vm_flags_init(vma, vm_flags);
-	vma->vm_page_prot = vm_get_page_prot(vm_flags);
+	vma_iter_config(vmi, map->addr, map->end);
+	vma_set_range(vma, map->addr, map->end, map->pgoff);
+	vm_flags_init(vma, map->flags);
+	vma->vm_page_prot = vm_get_page_prot(map->flags);
 
-	if (vma_iter_prealloc(&vmi, vma)) {
+	if (vma_iter_prealloc(vmi, vma)) {
 		error = -ENOMEM;
 		goto free_vma;
 	}
 
-	if (file) {
-		vma->vm_file = get_file(file);
-		error = mmap_file(file, vma);
-		if (error)
-			goto unmap_and_free_file_vma;
-
-		/* Drivers cannot alter the address of the VMA. */
-		WARN_ON_ONCE(addr != vma->vm_start);
-		/*
-		 * Drivers should not permit writability when previously it was
-		 * disallowed.
-		 */
-		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
-				!(vm_flags & VM_MAYWRITE) &&
-				(vma->vm_flags & VM_MAYWRITE));
-
-		vma_iter_config(&vmi, addr, end);
-		/*
-		 * If vm_flags changed after mmap_file(), we should try merge
-		 * vma again as we may succeed this time.
-		 */
-		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
-			struct vm_area_struct *merge;
-
-			vmg.flags = vma->vm_flags;
-			/* If this fails, state is reset ready for a reattempt. */
-			merge = vma_merge_new_range(&vmg);
-
-			if (merge) {
-				/*
-				 * ->mmap() can change vma->vm_file and fput
-				 * the original file. So fput the vma->vm_file
-				 * here or we would add an extra fput for file
-				 * and cause general protection fault
-				 * ultimately.
-				 */
-				fput(vma->vm_file);
-				vm_area_free(vma);
-				vma = merge;
-				/* Update vm_flags to pick up the change. */
-				vm_flags = vma->vm_flags;
-				goto file_expanded;
-			}
-			vma_iter_config(&vmi, addr, end);
-		}
-
-		vm_flags = vma->vm_flags;
-	} else if (vm_flags & VM_SHARED) {
+	if (map->file)
+		error = __mmap_new_file_vma(map, vmg, &vma, &merged);
+	else if (map->flags & VM_SHARED)
 		error = shmem_zero_setup(vma);
-		if (error)
-			goto free_iter_vma;
-	} else {
+	else
 		vma_set_anonymous(vma);
-	}
+
+	if (error)
+		goto free_iter_vma;
+
+	if (merged)
+		goto file_expanded;
 
 #ifdef CONFIG_SPARC64
 	/* TODO: Fix SPARC ADI! */
-	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+	WARN_ON_ONCE(!arch_validate_flags(map->flags));
 #endif
 
 	/* Lock the VMA since it is modified after insertion into VMA tree */
 	vma_start_write(vma);
-	vma_iter_store(&vmi, vma);
-	mm->map_count++;
+	vma_iter_store(vmi, vma);
+	map->mm->map_count++;
 	vma_link_file(vma);
 
 	/*
 	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
 	 * call covers the non-merge case.
 	 */
-	khugepaged_enter_vma(vma, vma->vm_flags);
+	khugepaged_enter_vma(vma, map->flags);
 
 file_expanded:
-	file = vma->vm_file;
 	ksm_add_vma(vma);
-expanded:
+	*vmap = vma;
+	return 0;
+
+free_iter_vma:
+	vma_iter_free(vmi);
+free_vma:
+	vm_area_free(vma);
+	return error;
+}
+
+/*
+ * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
+ *                     statistics, handle locking and finalise the VMA.
+ *
+ * @map: Mapping state.
+ * @vma: Merged or newly allocated VMA for the mmap()'d region.
+ */
+static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = map->mm;
+	unsigned long vm_flags = vma->vm_flags;
+
 	perf_event_mmap(vma);
 
-	/* Unmap any existing mapping in the area */
-	vms_complete_munmap_vmas(&vms, &mas_detach);
+	/* Unmap any existing mapping in the area. */
+	vms_complete_munmap_vmas(&map->vms, &map->mas_detach);
 
-	vm_stat_account(mm, vm_flags, pglen);
+	vm_stat_account(mm, vma->vm_flags, map->pglen);
 	if (vm_flags & VM_LOCKED) {
 		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
 					is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm))
+					vma == get_gate_vma(mm))
 			vm_flags_clear(vma, VM_LOCKED_MASK);
 		else
-			mm->locked_vm += pglen;
+			mm->locked_vm += map->pglen;
 	}
 
-	if (file)
+	if (vma->vm_file)
 		uprobe_mmap(vma);
 
 	/*
@@ -2363,26 +2468,42 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
 	vm_flags_set(vma, VM_SOFTDIRTY);
 
 	vma_set_page_prot(vma);
+}
 
-	return addr;
+unsigned long __mmap_region(struct file *file, unsigned long addr,
+		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+		struct list_head *uf)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	int error;
+	VMA_ITERATOR(vmi, mm, addr);
+	VMG_STATE(vmg, mm, &vmi, addr, addr + len, vm_flags, pgoff);
+	MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
 
-unmap_and_free_file_vma:
-	fput(vma->vm_file);
-	vma->vm_file = NULL;
+	error = __mmap_prepare(&map, &vmg, uf);
+	if (error)
+		goto abort_munmap;
 
-	vma_iter_set(&vmi, vma->vm_end);
-	/* Undo any partial mapping done by a device driver. */
-	unmap_region(&vmi.mas, vma, vmg.prev, vmg.next);
-free_iter_vma:
-	vma_iter_free(&vmi);
-free_vma:
-	vm_area_free(vma);
-unacct_error:
-	if (charged)
-		vm_unacct_memory(charged);
+	/* Attempt to merge with adjacent VMAs... */
+	vmg.flags = map.flags;
+	vma = vma_merge_new_range(&vmg);
+	if (!vma) {
+		/* ...but if we can't, allocate a new VMA. */
+		error = __mmap_new_vma(&map, &vmg, &vma);
+		if (error)
+			goto unacct_error;
+	}
+
+	__mmap_complete(&map, vma);
 
+	return addr;
+
+	/* Accounting was done by __mmap_prepare(). */
+unacct_error:
+	if (map.charged)
+		vm_unacct_memory(map.charged);
 abort_munmap:
-	vms_abort_munmap_vmas(&vms, &mas_detach);
-gather_failed:
+	vms_abort_munmap_vmas(&map.vms, &map.mas_detach);
 	return error;
 }
-- 
2.47.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2 8/8] mm: defer second attempt at merge on mmap()
  2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
                   ` (6 preceding siblings ...)
  2024-10-23 20:38 ` [PATCH v2 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
@ 2024-10-23 20:38 ` Lorenzo Stoakes
  2024-10-25  9:43   ` Vlastimil Babka
  7 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-23 20:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

Rather than trying to merge again when ostensibly allocating a new VMA,
instead defer until the VMA is added and attempt to merge the existing
range.

This way we have no complicated unwinding logic midway through the process
of mapping the VMA.

In addition this removes limitations on the VMA not being able to be the
first in the virtual memory address space which was previously implicitly
required.

It also performs this merge after the final flag adjustments are performed,
something that was not done previously and thus might have prevented
possibly valid merges in the past.

In theory, for this very same reason, we should unconditionally attempt
merge here, however this is likely to have a performance impact so it is
better to avoid this given the unlikely outcome of a merge.

The vmg state will already have been reset by the first attempt at a merge
so we only need to reset the iterator, set the vma and flags and try again.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 75 ++++++++++++++++++++------------------------------------
 1 file changed, 26 insertions(+), 49 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index 065f5e1f65be..c493ecebf394 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -19,6 +19,7 @@ struct mmap_state {
 	struct file *file;
 
 	unsigned long charged;
+	bool retry_merge;
 
 	struct vm_area_struct *prev;
 	struct vm_area_struct *next;
@@ -2280,9 +2281,9 @@ static int __mmap_prepare(struct mmap_state *map, struct vma_merge_struct *vmg,
 	return 0;
 }
 
+
 static int __mmap_new_file_vma(struct mmap_state *map,
-			       struct vma_merge_struct *vmg,
-			       struct vm_area_struct **vmap, bool *mergedp)
+			       struct vm_area_struct **vmap)
 {
 	struct vma_iterator *vmi = map->vmi;
 	struct vm_area_struct *vma = *vmap;
@@ -2311,37 +2312,11 @@ static int __mmap_new_file_vma(struct mmap_state *map,
 			!(map->flags & VM_MAYWRITE) &&
 			(vma->vm_flags & VM_MAYWRITE));
 
-	vma_iter_config(vmi, map->addr, map->end);
-	/*
-	 * If flags changed after mmap_file(), we should try merge
-	 * vma again as we may succeed this time.
-	 */
-	if (unlikely(map->flags != vma->vm_flags && map->prev)) {
-		struct vm_area_struct *merge;
-
-		vmg->flags = vma->vm_flags;
-		/* If this fails, state is reset ready for a reattempt. */
-		merge = vma_merge_new_range(vmg);
-
-		if (merge) {
-			/*
-			 * ->mmap() can change vma->vm_file and fput
-			 * the original file. So fput the vma->vm_file
-			 * here or we would add an extra fput for file
-			 * and cause general protection fault
-			 * ultimately.
-			 */
-			fput(vma->vm_file);
-			vm_area_free(vma);
-			vma = merge;
-			*mergedp = true;
-		} else {
-			vma_iter_config(vmi, map->addr, map->end);
-		}
-	}
+	/* If the flags change (and are mergeable), let's retry later. */
+	map->retry_merge = vma->vm_flags != map->flags && !(vma->vm_flags & VM_SPECIAL);
 
+	vma_iter_config(vmi, map->addr, map->end);
 	map->flags = vma->vm_flags;
-	*vmap = vma;
 	return 0;
 }
 
@@ -2349,22 +2324,15 @@ static int __mmap_new_file_vma(struct mmap_state *map,
  * __mmap_new_vma() - Allocate a new VMA for the region, as merging was not
  * possible.
  *
- * An exception to this is if the mapping is file-backed, and the underlying
- * driver changes the VMA flags, permitting a subsequent merge of the VMA, in
- * which case the returned VMA is one that was merged on a second attempt.
- *
  * @map:  Mapping state.
- * @vmg:  VMA merge state.
  * @vmap: Output pointer for the new VMA.
  *
  * Returns: Zero on success, or an error.
  */
-static int __mmap_new_vma(struct mmap_state *map, struct vma_merge_struct *vmg,
-			  struct vm_area_struct **vmap)
+static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 {
 	struct vma_iterator *vmi = map->vmi;
 	int error = 0;
-	bool merged = false;
 	struct vm_area_struct *vma;
 
 	/*
@@ -2387,7 +2355,7 @@ static int __mmap_new_vma(struct mmap_state *map, struct vma_merge_struct *vmg,
 	}
 
 	if (map->file)
-		error = __mmap_new_file_vma(map, vmg, &vma, &merged);
+		error = __mmap_new_file_vma(map, &vma);
 	else if (map->flags & VM_SHARED)
 		error = shmem_zero_setup(vma);
 	else
@@ -2396,9 +2364,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vma_merge_struct *vmg,
 	if (error)
 		goto free_iter_vma;
 
-	if (merged)
-		goto file_expanded;
-
 #ifdef CONFIG_SPARC64
 	/* TODO: Fix SPARC ADI! */
 	WARN_ON_ONCE(!arch_validate_flags(map->flags));
@@ -2415,8 +2380,6 @@ static int __mmap_new_vma(struct mmap_state *map, struct vma_merge_struct *vmg,
 	 * call covers the non-merge case.
 	 */
 	khugepaged_enter_vma(vma, map->flags);
-
-file_expanded:
 	ksm_add_vma(vma);
 	*vmap = vma;
 	return 0;
@@ -2430,13 +2393,17 @@ static int __mmap_new_vma(struct mmap_state *map, struct vma_merge_struct *vmg,
 
 /*
  * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
- *                     statistics, handle locking and finalise the VMA.
+ *                     statistics, handle locking and finalise the VMA,
+ *                     attempt a final merge if required.
  *
  * @map: Mapping state.
  * @vma: Merged or newly allocated VMA for the mmap()'d region.
+ * @vmg: VMA merge state.
  */
-static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
+static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma,
+			    struct vma_merge_struct *vmg)
 {
+
 	struct mm_struct *mm = map->mm;
 	unsigned long vm_flags = vma->vm_flags;
 
@@ -2468,6 +2435,16 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
 	vm_flags_set(vma, VM_SOFTDIRTY);
 
 	vma_set_page_prot(vma);
+
+	/* OK VMA flags changed in __mmap_new_vma(), try a merge again. */
+	if (map->retry_merge) {
+		vma_iter_config(map->vmi, map->addr, map->end);
+		vmg->vma = vma;
+		vmg->flags = map->flags;
+		vmg->next = NULL; /* Must be set by merge logic. */
+
+		vma_merge_existing_range(vmg);
+	}
 }
 
 unsigned long __mmap_region(struct file *file, unsigned long addr,
@@ -2490,12 +2467,12 @@ unsigned long __mmap_region(struct file *file, unsigned long addr,
 	vma = vma_merge_new_range(&vmg);
 	if (!vma) {
 		/* ...but if we can't, allocate a new VMA. */
-		error = __mmap_new_vma(&map, &vmg, &vma);
+		error = __mmap_new_vma(&map, &vma);
 		if (error)
 			goto unacct_error;
 	}
 
-	__mmap_complete(&map, vma);
+	__mmap_complete(&map, vma, &vmg);
 
 	return addr;
 
-- 
2.47.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 6/8] mm: isolate mmap internal logic to mm/vma.c
  2024-10-23 20:38 ` [PATCH v2 6/8] mm: isolate mmap internal logic to mm/vma.c Lorenzo Stoakes
@ 2024-10-24 17:23   ` Vlastimil Babka
  0 siblings, 0 replies; 43+ messages in thread
From: Vlastimil Babka @ 2024-10-24 17:23 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/23/24 22:38, Lorenzo Stoakes wrote:
> In previous commits we effected improvements to the mmap() logic in
> mmap_region() and its newly introduced internal implementation function
> __mmap_region().
> 
> However as these changes are intended to be backported, we kept the delta
> as small as is possible and made as few changes as possible to the newly
> introduced mm/vma.* files.
> 
> Take the opportunity to move this logic to mm/vma.c which not only isolates
> it, but also makes it available for later userland testing which can help
> us catch such logic errors far earlier.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

With help of git's colorMoved:

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 7/8] mm: refactor __mmap_region()
  2024-10-23 20:38 ` [PATCH v2 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
@ 2024-10-25  8:35   ` Vlastimil Babka
  2024-10-25 10:19     ` Lorenzo Stoakes
  0 siblings, 1 reply; 43+ messages in thread
From: Vlastimil Babka @ 2024-10-25  8:35 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/23/24 22:38, Lorenzo Stoakes wrote:
> We have seen bugs and resource leaks arise from the complexity of the
> __mmap_region() function. This, and the generally deeply fragile error
> handling logic and complexity which makes understanding the function
> difficult make it highly desirable to refactor it into something readable.
> 
> Achieve this by separating the function into smaller logical parts which
> are easier to understand and follow, and which importantly very
> significantly simplify the error handling.
> 
> Note that we now call vms_abort_munmap_vmas() in more error paths than we
> used to, however in cases where no abort need occur, vms->nr_pages will be
> equal to zero and we simply exit this function without doing more than we
> would have done previously.
> 
> Importantly, the invocation of the driver mmap hook via mmap_file() now has
> very simple and obvious handling (this was previously the most problematic
> part of the mmap() operation).
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Some nits:

> +static int __mmap_prepare(struct mmap_state *map, struct vma_merge_struct *vmg,
> +	struct list_head *uf)
>  {
> -	struct mm_struct *mm = current->mm;
> -	struct vm_area_struct *vma = NULL;
> -	pgoff_t pglen = PHYS_PFN(len);
> -	unsigned long charged = 0;
> -	struct vma_munmap_struct vms;
> -	struct ma_state mas_detach;
> -	struct maple_tree mt_detach;
> -	unsigned long end = addr + len;
>  	int error;
> -	VMA_ITERATOR(vmi, mm, addr);
> -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> -
> -	vmg.file = file;
> -	/* Find the first overlapping VMA */
> -	vma = vma_find(&vmi, end);
> -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> -	if (vma) {
> -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> -		mt_on_stack(mt_detach);
> -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> +	struct vma_iterator *vmi = map->vmi;
> +	struct vma_munmap_struct *vms = &map->vms;
> +
> +	/* Find the first overlapping VMA and initialise unmap state. */
> +	vms->vma = vma_find(vmi, map->end);
> +	init_vma_munmap(vms, vmi, vms->vma, map->addr, map->end, uf,
> +			/* unlock = */ false);
> +
> +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> +	if (vms->vma) {
> +		mt_init_flags(&map->mt_detach,
> +			      vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> +		mt_on_stack(map->mt_detach);
> +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
>  		/* Prepare to unmap any existing mapping in the area */
> -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> -		if (error)
> -			goto gather_failed;
> +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
> +		if (error) {
> +			/* On error VMAs will already have been reattached. */
> +			vms->nr_pages = 0;
> +			return error;
> +		}
>  
> -		vmg.next = vms.next;
> -		vmg.prev = vms.prev;
> -		vma = NULL;
> +		map->next = vms->next;
> +		map->prev = vms->prev;
>  	} else {
> -		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
> +		map->next = vma_iter_next_rewind(vmi, &map->prev);
>  	}
>  
> +	/* Set up vmg for merge attempt. */
> +	vmg->file = map->file;
> +	vmg->prev = map->prev;
> +	vmg->next = map->next;

It seems arbitrary to do this here. vmg->flags are set in __mmap_region().
Why not all of this? We wouldn't need to pass vmg to __mmap_prepare() at
all? Do any of the values chenge between here and returning and vmg needs to
capture them as they are now? AFAICS not.

> +
>  	/* Check against address space limit. */
> -	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
> -		error = -ENOMEM;
> -		goto abort_munmap;
> -	}
> +	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
> +		return -ENOMEM;
>  
> -	/*
> -	 * Private writable mapping: check memory availability
> -	 */
> -	if (accountable_mapping(file, vm_flags)) {
> -		charged = pglen;
> -		charged -= vms.nr_accounted;
> -		if (charged) {
> -			error = security_vm_enough_memory_mm(mm, charged);
> +	/* Private writable mapping: check memory availability. */
> +	if (accountable_mapping(map->file, map->flags)) {
> +		map->charged = map->pglen;
> +		map->charged -= vms->nr_accounted;
> +		if (map->charged) {
> +			error = security_vm_enough_memory_mm(map->mm, map->charged);
>  			if (error)
> -				goto abort_munmap;
> +				return error;
>  		}
>  
> -		vms.nr_accounted = 0;
> -		vm_flags |= VM_ACCOUNT;
> -		vmg.flags = vm_flags;
> +		vms->nr_accounted = 0;
> +		map->flags |= VM_ACCOUNT;
>  	}
>  
>  	/*
> -	 * clear PTEs while the vma is still in the tree so that rmap
> +	 * Clear PTEs while the vma is still in the tree so that rmap
>  	 * cannot race with the freeing later in the truncate scenario.
>  	 * This is also needed for mmap_file(), which is why vm_ops
>  	 * close function is called.
>  	 */
> -	vms_clean_up_area(&vms, &mas_detach);
> -	vma = vma_merge_new_range(&vmg);
> -	if (vma)
> -		goto expanded;
> +	vms_clean_up_area(vms, &map->mas_detach);
> +
> +	return 0;
> +}
> +

<snip>

> +static int __mmap_new_vma(struct mmap_state *map, struct vma_merge_struct *vmg,
> +			  struct vm_area_struct **vmap)
> +{
> +	struct vma_iterator *vmi = map->vmi;
> +	int error = 0;
> +	bool merged = false;
> +	struct vm_area_struct *vma;
> +
>  	/*
>  	 * Determine the object being mapped and call the appropriate
>  	 * specific mapper. the address has already been validated, but
>  	 * not unmapped, but the maps are removed from the list.
>  	 */
> -	vma = vm_area_alloc(mm);
> -	if (!vma) {
> -		error = -ENOMEM;
> -		goto unacct_error;
> -	}
> +	vma = vm_area_alloc(map->mm);
> +	if (!vma)
> +		return -ENOMEM;
>  
> -	vma_iter_config(&vmi, addr, end);
> -	vma_set_range(vma, addr, end, pgoff);
> -	vm_flags_init(vma, vm_flags);
> -	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> +	vma_iter_config(vmi, map->addr, map->end);
> +	vma_set_range(vma, map->addr, map->end, map->pgoff);
> +	vm_flags_init(vma, map->flags);
> +	vma->vm_page_prot = vm_get_page_prot(map->flags);
>  
> -	if (vma_iter_prealloc(&vmi, vma)) {
> +	if (vma_iter_prealloc(vmi, vma)) {
>  		error = -ENOMEM;
>  		goto free_vma;
>  	}
>  
> -	if (file) {
> -		vma->vm_file = get_file(file);
> -		error = mmap_file(file, vma);
> -		if (error)
> -			goto unmap_and_free_file_vma;
> -
> -		/* Drivers cannot alter the address of the VMA. */
> -		WARN_ON_ONCE(addr != vma->vm_start);
> -		/*
> -		 * Drivers should not permit writability when previously it was
> -		 * disallowed.
> -		 */
> -		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> -				!(vm_flags & VM_MAYWRITE) &&
> -				(vma->vm_flags & VM_MAYWRITE));
> -
> -		vma_iter_config(&vmi, addr, end);
> -		/*
> -		 * If vm_flags changed after mmap_file(), we should try merge
> -		 * vma again as we may succeed this time.
> -		 */
> -		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> -			struct vm_area_struct *merge;
> -
> -			vmg.flags = vma->vm_flags;
> -			/* If this fails, state is reset ready for a reattempt. */
> -			merge = vma_merge_new_range(&vmg);
> -
> -			if (merge) {
> -				/*
> -				 * ->mmap() can change vma->vm_file and fput
> -				 * the original file. So fput the vma->vm_file
> -				 * here or we would add an extra fput for file
> -				 * and cause general protection fault
> -				 * ultimately.
> -				 */
> -				fput(vma->vm_file);
> -				vm_area_free(vma);
> -				vma = merge;
> -				/* Update vm_flags to pick up the change. */
> -				vm_flags = vma->vm_flags;
> -				goto file_expanded;
> -			}
> -			vma_iter_config(&vmi, addr, end);
> -		}
> -
> -		vm_flags = vma->vm_flags;
> -	} else if (vm_flags & VM_SHARED) {
> +	if (map->file)
> +		error = __mmap_new_file_vma(map, vmg, &vma, &merged);
> +	else if (map->flags & VM_SHARED)
>  		error = shmem_zero_setup(vma);
> -		if (error)
> -			goto free_iter_vma;
> -	} else {
> +	else
>  		vma_set_anonymous(vma);
> -	}
> +
> +	if (error)
> +		goto free_iter_vma;
> +
> +	if (merged)
> +		goto file_expanded;
>  
>  #ifdef CONFIG_SPARC64
>  	/* TODO: Fix SPARC ADI! */
> -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> +	WARN_ON_ONCE(!arch_validate_flags(map->flags));
>  #endif
>  
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
>  	vma_start_write(vma);
> -	vma_iter_store(&vmi, vma);
> -	mm->map_count++;
> +	vma_iter_store(vmi, vma);
> +	map->mm->map_count++;
>  	vma_link_file(vma);
>  
>  	/*
>  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
>  	 * call covers the non-merge case.
>  	 */
> -	khugepaged_enter_vma(vma, vma->vm_flags);
> +	khugepaged_enter_vma(vma, map->flags);
>  
>  file_expanded:
> -	file = vma->vm_file;
>  	ksm_add_vma(vma);

I'm wondering if this could go to the "non file expanded" region above. If
we expanded a vma, it was either in ksm and stays in ksm, or it was not
eligible and that couldn't have changed by expanding?

> -expanded:
> +	*vmap = vma;
> +	return 0;
> +
> +free_iter_vma:
> +	vma_iter_free(vmi);
> +free_vma:
> +	vm_area_free(vma);
> +	return error;
> +}
> +


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 8/8] mm: defer second attempt at merge on mmap()
  2024-10-23 20:38 ` [PATCH v2 8/8] mm: defer second attempt at merge on mmap() Lorenzo Stoakes
@ 2024-10-25  9:43   ` Vlastimil Babka
  2024-10-25 10:20     ` Lorenzo Stoakes
  0 siblings, 1 reply; 43+ messages in thread
From: Vlastimil Babka @ 2024-10-25  9:43 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Jann Horn, linux-kernel, linux-mm,
	Linus Torvalds, Peter Xu

On 10/23/24 22:38, Lorenzo Stoakes wrote:
> Rather than trying to merge again when ostensibly allocating a new VMA,
> instead defer until the VMA is added and attempt to merge the existing
> range.
> 
> This way we have no complicated unwinding logic midway through the process
> of mapping the VMA.
> 
> In addition this removes limitations on the VMA not being able to be the
> first in the virtual memory address space which was previously implicitly
> required.
> 
> It also performs this merge after the final flag adjustments are performed,
> something that was not done previously and thus might have prevented
> possibly valid merges in the past.
> 
> In theory, for this very same reason, we should unconditionally attempt
> merge here, however this is likely to have a performance impact so it is
> better to avoid this given the unlikely outcome of a merge.

Maybe just expand the cases where we set map->retry_merge, i.e. in case the
final flag adjustments do anything?

> The vmg state will already have been reset by the first attempt at a merge
> so we only need to reset the iterator, set the vma and flags and try again.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

A nit:

> ---
>  mm/vma.c | 75 ++++++++++++++++++++------------------------------------
>  1 file changed, 26 insertions(+), 49 deletions(-)
> 
> diff --git a/mm/vma.c b/mm/vma.c
> index 065f5e1f65be..c493ecebf394 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -19,6 +19,7 @@ struct mmap_state {
>  	struct file *file;
>  
>  	unsigned long charged;
> +	bool retry_merge;
>  
>  	struct vm_area_struct *prev;
>  	struct vm_area_struct *next;
> @@ -2280,9 +2281,9 @@ static int __mmap_prepare(struct mmap_state *map, struct vma_merge_struct *vmg,
>  	return 0;
>  }
>  
> +
>  static int __mmap_new_file_vma(struct mmap_state *map,
> -			       struct vma_merge_struct *vmg,
> -			       struct vm_area_struct **vmap, bool *mergedp)
> +			       struct vm_area_struct **vmap)
>  {
>  	struct vma_iterator *vmi = map->vmi;
>  	struct vm_area_struct *vma = *vmap;
> @@ -2311,37 +2312,11 @@ static int __mmap_new_file_vma(struct mmap_state *map,
>  			!(map->flags & VM_MAYWRITE) &&
>  			(vma->vm_flags & VM_MAYWRITE));
>  
> -	vma_iter_config(vmi, map->addr, map->end);
> -	/*
> -	 * If flags changed after mmap_file(), we should try merge
> -	 * vma again as we may succeed this time.
> -	 */
> -	if (unlikely(map->flags != vma->vm_flags && map->prev)) {
> -		struct vm_area_struct *merge;
> -
> -		vmg->flags = vma->vm_flags;
> -		/* If this fails, state is reset ready for a reattempt. */
> -		merge = vma_merge_new_range(vmg);
> -
> -		if (merge) {
> -			/*
> -			 * ->mmap() can change vma->vm_file and fput
> -			 * the original file. So fput the vma->vm_file
> -			 * here or we would add an extra fput for file
> -			 * and cause general protection fault
> -			 * ultimately.
> -			 */
> -			fput(vma->vm_file);
> -			vm_area_free(vma);
> -			vma = merge;
> -			*mergedp = true;
> -		} else {
> -			vma_iter_config(vmi, map->addr, map->end);
> -		}
> -	}
> +	/* If the flags change (and are mergeable), let's retry later. */
> +	map->retry_merge = vma->vm_flags != map->flags && !(vma->vm_flags & VM_SPECIAL);
>  
> +	vma_iter_config(vmi, map->addr, map->end);

Do we need this still? __mmap_new_vma() did that and nothing changed since
in the non-error case, AFAICS?

>  	map->flags = vma->vm_flags;
> -	*vmap = vma;
>  	return 0;
>  }
>  


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 7/8] mm: refactor __mmap_region()
  2024-10-25  8:35   ` Vlastimil Babka
@ 2024-10-25 10:19     ` Lorenzo Stoakes
  2024-10-25 10:23       ` Vlastimil Babka
  0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-25 10:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Fri, Oct 25, 2024 at 10:35:20AM +0200, Vlastimil Babka wrote:
> On 10/23/24 22:38, Lorenzo Stoakes wrote:
> > We have seen bugs and resource leaks arise from the complexity of the
> > __mmap_region() function. This, and the generally deeply fragile error
> > handling logic and complexity which makes understanding the function
> > difficult make it highly desirable to refactor it into something readable.
> >
> > Achieve this by separating the function into smaller logical parts which
> > are easier to understand and follow, and which importantly very
> > significantly simplify the error handling.
> >
> > Note that we now call vms_abort_munmap_vmas() in more error paths than we
> > used to, however in cases where no abort need occur, vms->nr_pages will be
> > equal to zero and we simply exit this function without doing more than we
> > would have done previously.
> >
> > Importantly, the invocation of the driver mmap hook via mmap_file() now has
> > very simple and obvious handling (this was previously the most problematic
> > part of the mmap() operation).
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
> Some nits:
>
> > +static int __mmap_prepare(struct mmap_state *map, struct vma_merge_struct *vmg,
> > +	struct list_head *uf)
> >  {
> > -	struct mm_struct *mm = current->mm;
> > -	struct vm_area_struct *vma = NULL;
> > -	pgoff_t pglen = PHYS_PFN(len);
> > -	unsigned long charged = 0;
> > -	struct vma_munmap_struct vms;
> > -	struct ma_state mas_detach;
> > -	struct maple_tree mt_detach;
> > -	unsigned long end = addr + len;
> >  	int error;
> > -	VMA_ITERATOR(vmi, mm, addr);
> > -	VMG_STATE(vmg, mm, &vmi, addr, end, vm_flags, pgoff);
> > -
> > -	vmg.file = file;
> > -	/* Find the first overlapping VMA */
> > -	vma = vma_find(&vmi, end);
> > -	init_vma_munmap(&vms, &vmi, vma, addr, end, uf, /* unlock = */ false);
> > -	if (vma) {
> > -		mt_init_flags(&mt_detach, vmi.mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > -		mt_on_stack(mt_detach);
> > -		mas_init(&mas_detach, &mt_detach, /* addr = */ 0);
> > +	struct vma_iterator *vmi = map->vmi;
> > +	struct vma_munmap_struct *vms = &map->vms;
> > +
> > +	/* Find the first overlapping VMA and initialise unmap state. */
> > +	vms->vma = vma_find(vmi, map->end);
> > +	init_vma_munmap(vms, vmi, vms->vma, map->addr, map->end, uf,
> > +			/* unlock = */ false);
> > +
> > +	/* OK, we have overlapping VMAs - prepare to unmap them. */
> > +	if (vms->vma) {
> > +		mt_init_flags(&map->mt_detach,
> > +			      vmi->mas.tree->ma_flags & MT_FLAGS_LOCK_MASK);
> > +		mt_on_stack(map->mt_detach);
> > +		mas_init(&map->mas_detach, &map->mt_detach, /* addr = */ 0);
> >  		/* Prepare to unmap any existing mapping in the area */
> > -		error = vms_gather_munmap_vmas(&vms, &mas_detach);
> > -		if (error)
> > -			goto gather_failed;
> > +		error = vms_gather_munmap_vmas(vms, &map->mas_detach);
> > +		if (error) {
> > +			/* On error VMAs will already have been reattached. */
> > +			vms->nr_pages = 0;
> > +			return error;
> > +		}
> >
> > -		vmg.next = vms.next;
> > -		vmg.prev = vms.prev;
> > -		vma = NULL;
> > +		map->next = vms->next;
> > +		map->prev = vms->prev;
> >  	} else {
> > -		vmg.next = vma_iter_next_rewind(&vmi, &vmg.prev);
> > +		map->next = vma_iter_next_rewind(vmi, &map->prev);
> >  	}
> >
> > +	/* Set up vmg for merge attempt. */
> > +	vmg->file = map->file;
> > +	vmg->prev = map->prev;
> > +	vmg->next = map->next;
>
> It seems arbitrary to do this here. vmg->flags are set in __mmap_region().
> Why not all of this? We wouldn't need to pass vmg to __mmap_prepare() at
> all? Do any of the values chenge between here and returning and vmg needs to
> capture them as they are now? AFAICS not.

Yeah I sort of felt we are 'preparing' things.

This _was_ necessary before when we were ickily leaning on the vmg to store
state and not mutate it which is not ideal.

So yeah I think probably it's better to just put _all_ the merge stuff in
__mmap_region() rather than just some.

This aligns with the need to strip out the 'reset' logic from
vma_merge_new_range().

>
> > +
> >  	/* Check against address space limit. */
> > -	if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages)) {
> > -		error = -ENOMEM;
> > -		goto abort_munmap;
> > -	}
> > +	if (!may_expand_vm(map->mm, map->flags, map->pglen - vms->nr_pages))
> > +		return -ENOMEM;
> >
> > -	/*
> > -	 * Private writable mapping: check memory availability
> > -	 */
> > -	if (accountable_mapping(file, vm_flags)) {
> > -		charged = pglen;
> > -		charged -= vms.nr_accounted;
> > -		if (charged) {
> > -			error = security_vm_enough_memory_mm(mm, charged);
> > +	/* Private writable mapping: check memory availability. */
> > +	if (accountable_mapping(map->file, map->flags)) {
> > +		map->charged = map->pglen;
> > +		map->charged -= vms->nr_accounted;
> > +		if (map->charged) {
> > +			error = security_vm_enough_memory_mm(map->mm, map->charged);
> >  			if (error)
> > -				goto abort_munmap;
> > +				return error;
> >  		}
> >
> > -		vms.nr_accounted = 0;
> > -		vm_flags |= VM_ACCOUNT;
> > -		vmg.flags = vm_flags;
> > +		vms->nr_accounted = 0;
> > +		map->flags |= VM_ACCOUNT;
> >  	}
> >
> >  	/*
> > -	 * clear PTEs while the vma is still in the tree so that rmap
> > +	 * Clear PTEs while the vma is still in the tree so that rmap
> >  	 * cannot race with the freeing later in the truncate scenario.
> >  	 * This is also needed for mmap_file(), which is why vm_ops
> >  	 * close function is called.
> >  	 */
> > -	vms_clean_up_area(&vms, &mas_detach);
> > -	vma = vma_merge_new_range(&vmg);
> > -	if (vma)
> > -		goto expanded;
> > +	vms_clean_up_area(vms, &map->mas_detach);
> > +
> > +	return 0;
> > +}
> > +
>
> <snip>
>
> > +static int __mmap_new_vma(struct mmap_state *map, struct vma_merge_struct *vmg,
> > +			  struct vm_area_struct **vmap)
> > +{
> > +	struct vma_iterator *vmi = map->vmi;
> > +	int error = 0;
> > +	bool merged = false;
> > +	struct vm_area_struct *vma;
> > +
> >  	/*
> >  	 * Determine the object being mapped and call the appropriate
> >  	 * specific mapper. the address has already been validated, but
> >  	 * not unmapped, but the maps are removed from the list.
> >  	 */
> > -	vma = vm_area_alloc(mm);
> > -	if (!vma) {
> > -		error = -ENOMEM;
> > -		goto unacct_error;
> > -	}
> > +	vma = vm_area_alloc(map->mm);
> > +	if (!vma)
> > +		return -ENOMEM;
> >
> > -	vma_iter_config(&vmi, addr, end);
> > -	vma_set_range(vma, addr, end, pgoff);
> > -	vm_flags_init(vma, vm_flags);
> > -	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > +	vma_iter_config(vmi, map->addr, map->end);
> > +	vma_set_range(vma, map->addr, map->end, map->pgoff);
> > +	vm_flags_init(vma, map->flags);
> > +	vma->vm_page_prot = vm_get_page_prot(map->flags);
> >
> > -	if (vma_iter_prealloc(&vmi, vma)) {
> > +	if (vma_iter_prealloc(vmi, vma)) {
> >  		error = -ENOMEM;
> >  		goto free_vma;
> >  	}
> >
> > -	if (file) {
> > -		vma->vm_file = get_file(file);
> > -		error = mmap_file(file, vma);
> > -		if (error)
> > -			goto unmap_and_free_file_vma;
> > -
> > -		/* Drivers cannot alter the address of the VMA. */
> > -		WARN_ON_ONCE(addr != vma->vm_start);
> > -		/*
> > -		 * Drivers should not permit writability when previously it was
> > -		 * disallowed.
> > -		 */
> > -		VM_WARN_ON_ONCE(vm_flags != vma->vm_flags &&
> > -				!(vm_flags & VM_MAYWRITE) &&
> > -				(vma->vm_flags & VM_MAYWRITE));
> > -
> > -		vma_iter_config(&vmi, addr, end);
> > -		/*
> > -		 * If vm_flags changed after mmap_file(), we should try merge
> > -		 * vma again as we may succeed this time.
> > -		 */
> > -		if (unlikely(vm_flags != vma->vm_flags && vmg.prev)) {
> > -			struct vm_area_struct *merge;
> > -
> > -			vmg.flags = vma->vm_flags;
> > -			/* If this fails, state is reset ready for a reattempt. */
> > -			merge = vma_merge_new_range(&vmg);
> > -
> > -			if (merge) {
> > -				/*
> > -				 * ->mmap() can change vma->vm_file and fput
> > -				 * the original file. So fput the vma->vm_file
> > -				 * here or we would add an extra fput for file
> > -				 * and cause general protection fault
> > -				 * ultimately.
> > -				 */
> > -				fput(vma->vm_file);
> > -				vm_area_free(vma);
> > -				vma = merge;
> > -				/* Update vm_flags to pick up the change. */
> > -				vm_flags = vma->vm_flags;
> > -				goto file_expanded;
> > -			}
> > -			vma_iter_config(&vmi, addr, end);
> > -		}
> > -
> > -		vm_flags = vma->vm_flags;
> > -	} else if (vm_flags & VM_SHARED) {
> > +	if (map->file)
> > +		error = __mmap_new_file_vma(map, vmg, &vma, &merged);
> > +	else if (map->flags & VM_SHARED)
> >  		error = shmem_zero_setup(vma);
> > -		if (error)
> > -			goto free_iter_vma;
> > -	} else {
> > +	else
> >  		vma_set_anonymous(vma);
> > -	}
> > +
> > +	if (error)
> > +		goto free_iter_vma;
> > +
> > +	if (merged)
> > +		goto file_expanded;
> >
> >  #ifdef CONFIG_SPARC64
> >  	/* TODO: Fix SPARC ADI! */
> > -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> > +	WARN_ON_ONCE(!arch_validate_flags(map->flags));
> >  #endif
> >
> >  	/* Lock the VMA since it is modified after insertion into VMA tree */
> >  	vma_start_write(vma);
> > -	vma_iter_store(&vmi, vma);
> > -	mm->map_count++;
> > +	vma_iter_store(vmi, vma);
> > +	map->mm->map_count++;
> >  	vma_link_file(vma);
> >
> >  	/*
> >  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
> >  	 * call covers the non-merge case.
> >  	 */
> > -	khugepaged_enter_vma(vma, vma->vm_flags);
> > +	khugepaged_enter_vma(vma, map->flags);
> >
> >  file_expanded:
> > -	file = vma->vm_file;
> >  	ksm_add_vma(vma);
>
> I'm wondering if this could go to the "non file expanded" region above. If
> we expanded a vma, it was either in ksm and stays in ksm, or it was not
> eligible and that couldn't have changed by expanding?

We could change this, but the next commit removes the need for this :)

>
> > -expanded:
> > +	*vmap = vma;
> > +	return 0;
> > +
> > +free_iter_vma:
> > +	vma_iter_free(vmi);
> > +free_vma:
> > +	vm_area_free(vma);
> > +	return error;
> > +}
> > +


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 8/8] mm: defer second attempt at merge on mmap()
  2024-10-25  9:43   ` Vlastimil Babka
@ 2024-10-25 10:20     ` Lorenzo Stoakes
  0 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-25 10:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On Fri, Oct 25, 2024 at 11:43:20AM +0200, Vlastimil Babka wrote:
> On 10/23/24 22:38, Lorenzo Stoakes wrote:
> > Rather than trying to merge again when ostensibly allocating a new VMA,
> > instead defer until the VMA is added and attempt to merge the existing
> > range.
> >
> > This way we have no complicated unwinding logic midway through the process
> > of mapping the VMA.
> >
> > In addition this removes limitations on the VMA not being able to be the
> > first in the virtual memory address space which was previously implicitly
> > required.
> >
> > It also performs this merge after the final flag adjustments are performed,
> > something that was not done previously and thus might have prevented
> > possibly valid merges in the past.
> >
> > In theory, for this very same reason, we should unconditionally attempt
> > merge here, however this is likely to have a performance impact so it is
> > better to avoid this given the unlikely outcome of a merge.
>
> Maybe just expand the cases where we set map->retry_merge, i.e. in case the
> final flag adjustments do anything?
>
> > The vmg state will already have been reset by the first attempt at a merge
> > so we only need to reset the iterator, set the vma and flags and try again.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks (and for 7/8!)

>
> A nit:
>
> > ---
> >  mm/vma.c | 75 ++++++++++++++++++++------------------------------------
> >  1 file changed, 26 insertions(+), 49 deletions(-)
> >
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 065f5e1f65be..c493ecebf394 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -19,6 +19,7 @@ struct mmap_state {
> >  	struct file *file;
> >
> >  	unsigned long charged;
> > +	bool retry_merge;
> >
> >  	struct vm_area_struct *prev;
> >  	struct vm_area_struct *next;
> > @@ -2280,9 +2281,9 @@ static int __mmap_prepare(struct mmap_state *map, struct vma_merge_struct *vmg,
> >  	return 0;
> >  }
> >
> > +
> >  static int __mmap_new_file_vma(struct mmap_state *map,
> > -			       struct vma_merge_struct *vmg,
> > -			       struct vm_area_struct **vmap, bool *mergedp)
> > +			       struct vm_area_struct **vmap)
> >  {
> >  	struct vma_iterator *vmi = map->vmi;
> >  	struct vm_area_struct *vma = *vmap;
> > @@ -2311,37 +2312,11 @@ static int __mmap_new_file_vma(struct mmap_state *map,
> >  			!(map->flags & VM_MAYWRITE) &&
> >  			(vma->vm_flags & VM_MAYWRITE));
> >
> > -	vma_iter_config(vmi, map->addr, map->end);
> > -	/*
> > -	 * If flags changed after mmap_file(), we should try merge
> > -	 * vma again as we may succeed this time.
> > -	 */
> > -	if (unlikely(map->flags != vma->vm_flags && map->prev)) {
> > -		struct vm_area_struct *merge;
> > -
> > -		vmg->flags = vma->vm_flags;
> > -		/* If this fails, state is reset ready for a reattempt. */
> > -		merge = vma_merge_new_range(vmg);
> > -
> > -		if (merge) {
> > -			/*
> > -			 * ->mmap() can change vma->vm_file and fput
> > -			 * the original file. So fput the vma->vm_file
> > -			 * here or we would add an extra fput for file
> > -			 * and cause general protection fault
> > -			 * ultimately.
> > -			 */
> > -			fput(vma->vm_file);
> > -			vm_area_free(vma);
> > -			vma = merge;
> > -			*mergedp = true;
> > -		} else {
> > -			vma_iter_config(vmi, map->addr, map->end);
> > -		}
> > -	}
> > +	/* If the flags change (and are mergeable), let's retry later. */
> > +	map->retry_merge = vma->vm_flags != map->flags && !(vma->vm_flags & VM_SPECIAL);
> >
> > +	vma_iter_config(vmi, map->addr, map->end);
>
> Do we need this still? __mmap_new_vma() did that and nothing changed since
> in the non-error case, AFAICS?

You're right, this change really highlights that, will remove thanks!

>
> >  	map->flags = vma->vm_flags;
> > -	*vmap = vma;
> >  	return 0;
> >  }
> >


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 7/8] mm: refactor __mmap_region()
  2024-10-25 10:19     ` Lorenzo Stoakes
@ 2024-10-25 10:23       ` Vlastimil Babka
  0 siblings, 0 replies; 43+ messages in thread
From: Vlastimil Babka @ 2024-10-25 10:23 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Jann Horn, linux-kernel,
	linux-mm, Linus Torvalds, Peter Xu

On 10/25/24 12:19, Lorenzo Stoakes wrote:
> On Fri, Oct 25, 2024 at 10:35:20AM +0200, Vlastimil Babka wrote:
>> >
>> >  	/* Lock the VMA since it is modified after insertion into VMA tree */
>> >  	vma_start_write(vma);
>> > -	vma_iter_store(&vmi, vma);
>> > -	mm->map_count++;
>> > +	vma_iter_store(vmi, vma);
>> > +	map->mm->map_count++;
>> >  	vma_link_file(vma);
>> >
>> >  	/*
>> >  	 * vma_merge_new_range() calls khugepaged_enter_vma() too, the below
>> >  	 * call covers the non-merge case.
>> >  	 */
>> > -	khugepaged_enter_vma(vma, vma->vm_flags);
>> > +	khugepaged_enter_vma(vma, map->flags);
>> >
>> >  file_expanded:
>> > -	file = vma->vm_file;
>> >  	ksm_add_vma(vma);
>>
>> I'm wondering if this could go to the "non file expanded" region above. If
>> we expanded a vma, it was either in ksm and stays in ksm, or it was not
>> eligible and that couldn't have changed by expanding?
> 
> We could change this, but the next commit removes the need for this :)

Yeah, no need to change here then.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
@ 2024-10-28 18:29   ` Mark Brown
  2024-10-28 18:57     ` Lorenzo Stoakes
  0 siblings, 1 reply; 43+ messages in thread
From: Mark Brown @ 2024-10-28 18:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu,
	linux-arm-kernel, Catalin Marinas, Will Deacon, Aishwarya TCV

[-- Attachment #1: Type: text/plain, Size: 5141 bytes --]

On Wed, Oct 23, 2024 at 09:38:29PM +0100, Lorenzo Stoakes wrote:
> The mmap_region() function is somewhat terrifying, with spaghetti-like
> control flow and numerous means by which issues can arise and incomplete
> state, memory leaks and other unpleasantness can occur.

Today's pending-fixes is showing a fairly large set of failures in the
arm64 MTE selftests on all the platforms that have MTE (currently just
the software ones).  Bisection points at this change which is
0967bf7fbd0e0 in -next which seems plausible but I didn't investigate in
any meaingful detail.  There's nothing particularly instructive in the
test logs, just plain reports that the tests failed:

  # # FAIL: mmap allocation
  # # FAIL: memory allocation
  # not ok 17 Check initial tags with private mapping, sync error mode and mmap memory
  # ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory
  # # FAIL: mmap allocation
  # # FAIL: memory allocation
  # not ok 19 Check initial tags with shared mapping, sync error mode and mmap memory
  # ok 20 Check initial tags with shared mapping, sync error mode and mmap/mprotect memory
  # # Totals: pass:18 fail:2 xfail:0 xpass:0 skip:0 error:0
  not ok 42 selftests: arm64: check_buffer_fill # exit=1

(and more, mainly on mmap related things).  A full log for a sample run
on the FVP can be seen at:

  https://lava.sirena.org.uk/scheduler/job/901638#L3693

and one from qemu here:

  https://lava.sirena.org.uk/scheduler/job/901630#L3031

Both of these logs include links to filesystem/firmware images and
command lines to run the model.

Bisects converge cleanly (there's some random extra good commits logged
at the start as my tooling feeds test results it already has on hand
between the good and bad commits into the bisect):

# bad: [6560005f01c3c14aab4c2ce35d97b75796d33d81] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
# good: [ea1fda89f5b23734e10c62762990120d5ae23c43] Merge tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
# good: [6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9] ASoC: Intel: sst: Support LPE0F28 ACPI HID
# good: [2feb023110843acce790e9089e72e9a9503d9fa5] regulator: rtq2208: Fix uninitialized use of regulator_config
# good: [0107f28f135231da22a9ad5756bb16bd5cada4d5] ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet
# good: [25f00a13dccf8e45441265768de46c8bf58e08f6] spi: spi-fsl-dspi: Fix crash when not using GPIO chip select
# good: [032532f91a1d06d0750f16c49a9698ef5374a68f] ASoC: codecs: rt5640: Always disable IRQs from rt5640_cancel_work()
# good: [d48696b915527b5bcdd207a299aec03fb037eb17] ASoC: Intel: bytcr_rt5640: Add support for non ACPI instantiated codec
# good: [d0ccf760a405d243a49485be0a43bd5b66ed17e2] spi: geni-qcom: Fix boot warning related to pm_runtime and devres
# good: [f2b5b8201b1545ef92e050735e9c768010d497aa] spi: mtk-snfi: fix kerneldoc for mtk_snand_is_page_ops()
# good: [b5a468199b995bd8ee3c26f169a416a181210c9e] spi: stm32: fix missing device mode capability in stm32mp25
git bisect start '6560005f01c3c14aab4c2ce35d97b75796d33d81' 'ea1fda89f5b23734e10c62762990120d5ae23c43' '6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9' '2feb023110843acce790e9089e72e9a9503d9fa5' '0107f28f135231da22a9ad5756bb16bd5cada4d5' '25f00a13dccf8e45441265768de46c8bf58e08f6' '032532f91a1d06d0750f16c49a9698ef5374a68f' 'd48696b915527b5bcdd207a299aec03fb037eb17' 'd0ccf760a405d243a49485be0a43bd5b66ed17e2' 'f2b5b8201b1545ef92e050735e9c768010d497aa' 'b5a468199b995bd8ee3c26f169a416a181210c9e'
# bad: [6560005f01c3c14aab4c2ce35d97b75796d33d81] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
git bisect bad 6560005f01c3c14aab4c2ce35d97b75796d33d81
# bad: [4a2901b5d394f58cdc60bc25e32c381bb2b83891] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless.git
git bisect bad 4a2901b5d394f58cdc60bc25e32c381bb2b83891
# bad: [4093d34d740447b23a1ea916dabcf902aa767812] Merge branch 'fs-current' of linux-next
git bisect bad 4093d34d740447b23a1ea916dabcf902aa767812
# bad: [0967bf7fbd0e03cee0525035762150a91ba1bb7c] mm: resolve faulty mmap_region() error path behaviour
git bisect bad 0967bf7fbd0e03cee0525035762150a91ba1bb7c
# good: [633e7df6cfdf97f8acf2a59fbfead01e31d0e492] tools: testing: add expand-only mode VMA test
git bisect good 633e7df6cfdf97f8acf2a59fbfead01e31d0e492
# good: [315add1ace71306a7d8518fd417466d938041ff1] mseal: update mseal.rst
git bisect good 315add1ace71306a7d8518fd417466d938041ff1
# good: [bcbb8b25ab80347994e33c358481e65f95f665fd] mm: fix PSWPIN counter for large folios swap-in
git bisect good bcbb8b25ab80347994e33c358481e65f95f665fd
# good: [8438cf67b86bf8c966f32612a7e12b2eb910396b] mm: unconditionally close VMAs on error
git bisect good 8438cf67b86bf8c966f32612a7e12b2eb910396b
# good: [a220e219d89c2d574ad9ffda627575e11334fede] mm: refactor map_deny_write_exec()
git bisect good a220e219d89c2d574ad9ffda627575e11334fede
# first bad commit: [0967bf7fbd0e03cee0525035762150a91ba1bb7c] mm: resolve faulty mmap_region() error path behaviour

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 18:29   ` Mark Brown
@ 2024-10-28 18:57     ` Lorenzo Stoakes
  2024-10-28 19:05       ` Linus Torvalds
  2024-10-28 20:51       ` Mark Brown
  0 siblings, 2 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-28 18:57 UTC (permalink / raw)
  To: Mark Brown
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu,
	linux-arm-kernel, Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, Oct 28, 2024 at 06:29:36PM +0000, Mark Brown wrote:
> On Wed, Oct 23, 2024 at 09:38:29PM +0100, Lorenzo Stoakes wrote:
> > The mmap_region() function is somewhat terrifying, with spaghetti-like
> > control flow and numerous means by which issues can arise and incomplete
> > state, memory leaks and other unpleasantness can occur.
>
> Today's pending-fixes is showing a fairly large set of failures in the
> arm64 MTE selftests on all the platforms that have MTE (currently just
> the software ones).  Bisection points at this change which is
> 0967bf7fbd0e0 in -next which seems plausible but I didn't investigate in
> any meaingful detail.  There's nothing particularly instructive in the
> test logs, just plain reports that the tests failed:

Ugh yep ok. Thanks for the report, this is likely then because MTE relies in
some way on merge behaviour or the ->mmap() hook in an unfortunate way that we
haven't accounted for here.

Bad time for my arm64 qemu to be broken :)

Would it be possible for you to assist me with investigating this a little as
you have things pretty well set up?

On these memory allocation failures, could you tell me what errno is? Could you
check dmesg for anything strange?

>
>   # # FAIL: mmap allocation

Interesting that it MAP_FAIL's though. This could be arch_validate_flags() being
moved around.

Could you do me a further favour then and try a kernel at this commit with:

	/* Allow architectures to sanity-check the vm_flags. */
	if (!arch_validate_flags(vm_flags))
		return -EINVAL;

In mmap_region() commented out?

That and the errno would be hugely useful information thank you!

Wondering if somehow the driver hook changes flags that makes the arch validate
flags pass but not with the original flags.

OK looking at thet source 99% certain it's the move of this check, as arm64 in
its hook for this does:

	/* only allow VM_MTE if VM_MTE_ALLOWED has been set previously */
	return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);

So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and
expects this to be checked after (ugh).

Thanks!

>   # # FAIL: memory allocation
>   # not ok 17 Check initial tags with private mapping, sync error mode and mmap memory
>   # ok 18 Check initial tags with private mapping, sync error mode and mmap/mprotect memory
>   # # FAIL: mmap allocation
>   # # FAIL: memory allocation
>   # not ok 19 Check initial tags with shared mapping, sync error mode and mmap memory
>   # ok 20 Check initial tags with shared mapping, sync error mode and mmap/mprotect memory
>   # # Totals: pass:18 fail:2 xfail:0 xpass:0 skip:0 error:0
>   not ok 42 selftests: arm64: check_buffer_fill # exit=1
>
> (and more, mainly on mmap related things).  A full log for a sample run
> on the FVP can be seen at:
>
>   https://lava.sirena.org.uk/scheduler/job/901638#L3693
>
> and one from qemu here:
>
>   https://lava.sirena.org.uk/scheduler/job/901630#L3031
>
> Both of these logs include links to filesystem/firmware images and
> command lines to run the model.
>
> Bisects converge cleanly (there's some random extra good commits logged
> at the start as my tooling feeds test results it already has on hand
> between the good and bad commits into the bisect):
>
> # bad: [6560005f01c3c14aab4c2ce35d97b75796d33d81] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
> # good: [ea1fda89f5b23734e10c62762990120d5ae23c43] Merge tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> # good: [6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9] ASoC: Intel: sst: Support LPE0F28 ACPI HID
> # good: [2feb023110843acce790e9089e72e9a9503d9fa5] regulator: rtq2208: Fix uninitialized use of regulator_config
> # good: [0107f28f135231da22a9ad5756bb16bd5cada4d5] ASoC: Intel: bytcr_rt5640: Add DMI quirk for Vexia Edu Atla 10 tablet
> # good: [25f00a13dccf8e45441265768de46c8bf58e08f6] spi: spi-fsl-dspi: Fix crash when not using GPIO chip select
> # good: [032532f91a1d06d0750f16c49a9698ef5374a68f] ASoC: codecs: rt5640: Always disable IRQs from rt5640_cancel_work()
> # good: [d48696b915527b5bcdd207a299aec03fb037eb17] ASoC: Intel: bytcr_rt5640: Add support for non ACPI instantiated codec
> # good: [d0ccf760a405d243a49485be0a43bd5b66ed17e2] spi: geni-qcom: Fix boot warning related to pm_runtime and devres
> # good: [f2b5b8201b1545ef92e050735e9c768010d497aa] spi: mtk-snfi: fix kerneldoc for mtk_snand_is_page_ops()
> # good: [b5a468199b995bd8ee3c26f169a416a181210c9e] spi: stm32: fix missing device mode capability in stm32mp25
> git bisect start '6560005f01c3c14aab4c2ce35d97b75796d33d81' 'ea1fda89f5b23734e10c62762990120d5ae23c43' '6668610b4d8ce9a3ee3ed61a9471f62fb5f05bf9' '2feb023110843acce790e9089e72e9a9503d9fa5' '0107f28f135231da22a9ad5756bb16bd5cada4d5' '25f00a13dccf8e45441265768de46c8bf58e08f6' '032532f91a1d06d0750f16c49a9698ef5374a68f' 'd48696b915527b5bcdd207a299aec03fb037eb17' 'd0ccf760a405d243a49485be0a43bd5b66ed17e2' 'f2b5b8201b1545ef92e050735e9c768010d497aa' 'b5a468199b995bd8ee3c26f169a416a181210c9e'
> # bad: [6560005f01c3c14aab4c2ce35d97b75796d33d81] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
> git bisect bad 6560005f01c3c14aab4c2ce35d97b75796d33d81
> # bad: [4a2901b5d394f58cdc60bc25e32c381bb2b83891] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless.git
> git bisect bad 4a2901b5d394f58cdc60bc25e32c381bb2b83891
> # bad: [4093d34d740447b23a1ea916dabcf902aa767812] Merge branch 'fs-current' of linux-next
> git bisect bad 4093d34d740447b23a1ea916dabcf902aa767812
> # bad: [0967bf7fbd0e03cee0525035762150a91ba1bb7c] mm: resolve faulty mmap_region() error path behaviour
> git bisect bad 0967bf7fbd0e03cee0525035762150a91ba1bb7c
> # good: [633e7df6cfdf97f8acf2a59fbfead01e31d0e492] tools: testing: add expand-only mode VMA test
> git bisect good 633e7df6cfdf97f8acf2a59fbfead01e31d0e492
> # good: [315add1ace71306a7d8518fd417466d938041ff1] mseal: update mseal.rst
> git bisect good 315add1ace71306a7d8518fd417466d938041ff1
> # good: [bcbb8b25ab80347994e33c358481e65f95f665fd] mm: fix PSWPIN counter for large folios swap-in
> git bisect good bcbb8b25ab80347994e33c358481e65f95f665fd
> # good: [8438cf67b86bf8c966f32612a7e12b2eb910396b] mm: unconditionally close VMAs on error
> git bisect good 8438cf67b86bf8c966f32612a7e12b2eb910396b
> # good: [a220e219d89c2d574ad9ffda627575e11334fede] mm: refactor map_deny_write_exec()
> git bisect good a220e219d89c2d574ad9ffda627575e11334fede
> # first bad commit: [0967bf7fbd0e03cee0525035762150a91ba1bb7c] mm: resolve faulty mmap_region() error path behaviour


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 18:57     ` Lorenzo Stoakes
@ 2024-10-28 19:05       ` Linus Torvalds
  2024-10-28 19:14         ` Lorenzo Stoakes
  2024-10-28 20:51       ` Mark Brown
  1 sibling, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-10-28 19:05 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Mark Brown, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and
> expects this to be checked after (ugh).

Gaah. Yes. mm/shmem.c: shmem_mmap() does

        /* arm64 - allow memory tagging on RAM-based files */
        vm_flags_set(vma, VM_MTE_ALLOWED);

and while I found the equivalent hack for the VM_SPARC_ADI case, I
hadn't noticed that MTE thing.

How very annoying.

So the arch_validate_flags() case does need to be done after the ->mmap() call.

How about just finalizing everything, and then doing a regular
munmap() afterwards and returning an error (all still holding the mmap
semaphore, of course).

That still avoids the whole "partially completed mmap" case.

             Linus

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 19:05       ` Linus Torvalds
@ 2024-10-28 19:14         ` Lorenzo Stoakes
  2024-10-28 19:50           ` Liam R. Howlett
  0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-28 19:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Brown, Andrew Morton, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, Oct 28, 2024 at 09:05:44AM -1000, Linus Torvalds wrote:
> On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and
> > expects this to be checked after (ugh).
>
> Gaah. Yes. mm/shmem.c: shmem_mmap() does
>
>         /* arm64 - allow memory tagging on RAM-based files */
>         vm_flags_set(vma, VM_MTE_ALLOWED);
>
> and while I found the equivalent hack for the VM_SPARC_ADI case, I
> hadn't noticed that MTE thing.
>
> How very annoying.
>
> So the arch_validate_flags() case does need to be done after the ->mmap() call.
>
> How about just finalizing everything, and then doing a regular
> munmap() afterwards and returning an error (all still holding the mmap
> semaphore, of course).
>
> That still avoids the whole "partially completed mmap" case.
>
>              Linus

Yeah I was thinking the same... just bite the bullet, go through the whole damn
process and revert if arch_validate_flags() chokes. It also removes the ugly
#ifdef CONFIG_SPARC64 hack...

This will litearlly only be applicable for these two cases and (hopefully) most
of the time you'd not fail it.

I mean by then it'll be added into the rmap and such but nothing will be
populated yet and we shouldn't be able to fault as vma_start_write() should have
incremented the vma lock seqnum.

Any issues from the RCU visibility stuff Liam?

Any security problems Jann...?

It'd suck to have to bring back a partial complete case. Though I do note we
handle errors from mmap_file() ok so we could still potentially handle that
there, but would sort of semi-undo some of the point of the series.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 19:14         ` Lorenzo Stoakes
@ 2024-10-28 19:50           ` Liam R. Howlett
  2024-10-28 20:00             ` Liam R. Howlett
  0 siblings, 1 reply; 43+ messages in thread
From: Liam R. Howlett @ 2024-10-28 19:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Linus Torvalds, Mark Brown, Andrew Morton, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241028 15:14]:
> On Mon, Oct 28, 2024 at 09:05:44AM -1000, Linus Torvalds wrote:
> > On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and
> > > expects this to be checked after (ugh).
> >
> > Gaah. Yes. mm/shmem.c: shmem_mmap() does
> >
> >         /* arm64 - allow memory tagging on RAM-based files */
> >         vm_flags_set(vma, VM_MTE_ALLOWED);
> >
> > and while I found the equivalent hack for the VM_SPARC_ADI case, I
> > hadn't noticed that MTE thing.
> >
> > How very annoying.
> >
> > So the arch_validate_flags() case does need to be done after the ->mmap() call.
> >
> > How about just finalizing everything, and then doing a regular
> > munmap() afterwards and returning an error (all still holding the mmap
> > semaphore, of course).
> >
> > That still avoids the whole "partially completed mmap" case.
> >
> >              Linus
> 
> Yeah I was thinking the same... just bite the bullet, go through the whole damn
> process and revert if arch_validate_flags() chokes. It also removes the ugly
> #ifdef CONFIG_SPARC64 hack...
> 
> This will litearlly only be applicable for these two cases and (hopefully) most
> of the time you'd not fail it.
> 
> I mean by then it'll be added into the rmap and such but nothing will be
> populated yet and we shouldn't be able to fault as vma_start_write() should have
> incremented the vma lock seqnum.
> 
> Any issues from the RCU visibility stuff Liam?

It is probably fine?  We would see a mapping appear then disappear.
We'd have a (benign) race with rmap for truncating the PTEs (but it's
safe).  Page faults would be stopped though.

Unfortunately, we'd have to write to the vma tree so that we could...
write to the vma tree.  We'd have to somehow ensure munmap() is done
with a gfp flag to ensure no failures as well...

Maybe we should just call close on the vma again (and do whatever
call_mmap() needs to undo)?

> 
> Any security problems Jann...?
> 
> It'd suck to have to bring back a partial complete case. Though I do note we
> handle errors from mmap_file() ok so we could still potentially handle that
> there, but would sort of semi-undo some of the point of the series.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 19:50           ` Liam R. Howlett
@ 2024-10-28 20:00             ` Liam R. Howlett
  2024-10-28 20:17               ` Lorenzo Stoakes
  0 siblings, 1 reply; 43+ messages in thread
From: Liam R. Howlett @ 2024-10-28 20:00 UTC (permalink / raw)
  To: Lorenzo Stoakes, Linus Torvalds, Mark Brown, Andrew Morton,
	Vlastimil Babka, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Catalin Marinas, Will Deacon, Aishwarya TCV

* Liam R. Howlett <Liam.Howlett@oracle.com> [241028 15:50]:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241028 15:14]:
> > On Mon, Oct 28, 2024 at 09:05:44AM -1000, Linus Torvalds wrote:
> > > On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and
> > > > expects this to be checked after (ugh).
> > >
> > > Gaah. Yes. mm/shmem.c: shmem_mmap() does
> > >
> > >         /* arm64 - allow memory tagging on RAM-based files */
> > >         vm_flags_set(vma, VM_MTE_ALLOWED);
> > >
> > > and while I found the equivalent hack for the VM_SPARC_ADI case, I
> > > hadn't noticed that MTE thing.
> > >
> > > How very annoying.
> > >
> > > So the arch_validate_flags() case does need to be done after the ->mmap() call.
> > >
> > > How about just finalizing everything, and then doing a regular
> > > munmap() afterwards and returning an error (all still holding the mmap
> > > semaphore, of course).
> > >
> > > That still avoids the whole "partially completed mmap" case.
> > >
> > >              Linus
> > 
> > Yeah I was thinking the same... just bite the bullet, go through the whole damn
> > process and revert if arch_validate_flags() chokes. It also removes the ugly
> > #ifdef CONFIG_SPARC64 hack...
> > 
> > This will litearlly only be applicable for these two cases and (hopefully) most
> > of the time you'd not fail it.
> > 
> > I mean by then it'll be added into the rmap and such but nothing will be
> > populated yet and we shouldn't be able to fault as vma_start_write() should have
> > incremented the vma lock seqnum.
> > 
> > Any issues from the RCU visibility stuff Liam?
> 
> It is probably fine?  We would see a mapping appear then disappear.
> We'd have a (benign) race with rmap for truncating the PTEs (but it's
> safe).  Page faults would be stopped though.
> 
> Unfortunately, we'd have to write to the vma tree so that we could...
> write to the vma tree.  We'd have to somehow ensure munmap() is done
> with a gfp flag to ensure no failures as well...
> 
> Maybe we should just call close on the vma again (and do whatever
> call_mmap() needs to undo)?

I take it back, that won't work.

> 
> > 
> > Any security problems Jann...?
> > 
> > It'd suck to have to bring back a partial complete case. Though I do note we
> > handle errors from mmap_file() ok so we could still potentially handle that
> > there, but would sort of semi-undo some of the point of the series.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 20:00             ` Liam R. Howlett
@ 2024-10-28 20:17               ` Lorenzo Stoakes
  2024-10-28 20:22                 ` Linus Torvalds
  0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-28 20:17 UTC (permalink / raw)
  To: Liam R. Howlett, Linus Torvalds, Mark Brown, Andrew Morton,
	Vlastimil Babka, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, Oct 28, 2024 at 04:00:29PM -0400, Liam R. Howlett wrote:
> * Liam R. Howlett <Liam.Howlett@oracle.com> [241028 15:50]:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241028 15:14]:
> > > On Mon, Oct 28, 2024 at 09:05:44AM -1000, Linus Torvalds wrote:
> > > > On Mon, 28 Oct 2024 at 08:57, Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > So likely hook on your mapping changes flags to set VM_MTE | VM_MTE_ALLOWED and
> > > > > expects this to be checked after (ugh).
> > > >
> > > > Gaah. Yes. mm/shmem.c: shmem_mmap() does
> > > >
> > > >         /* arm64 - allow memory tagging on RAM-based files */
> > > >         vm_flags_set(vma, VM_MTE_ALLOWED);
> > > >
> > > > and while I found the equivalent hack for the VM_SPARC_ADI case, I
> > > > hadn't noticed that MTE thing.
> > > >
> > > > How very annoying.
> > > >
> > > > So the arch_validate_flags() case does need to be done after the ->mmap() call.
> > > >
> > > > How about just finalizing everything, and then doing a regular
> > > > munmap() afterwards and returning an error (all still holding the mmap
> > > > semaphore, of course).
> > > >
> > > > That still avoids the whole "partially completed mmap" case.
> > > >
> > > >              Linus
> > >
> > > Yeah I was thinking the same... just bite the bullet, go through the whole damn
> > > process and revert if arch_validate_flags() chokes. It also removes the ugly
> > > #ifdef CONFIG_SPARC64 hack...
> > >
> > > This will litearlly only be applicable for these two cases and (hopefully) most
> > > of the time you'd not fail it.
> > >
> > > I mean by then it'll be added into the rmap and such but nothing will be
> > > populated yet and we shouldn't be able to fault as vma_start_write() should have
> > > incremented the vma lock seqnum.
> > >
> > > Any issues from the RCU visibility stuff Liam?
> >
> > It is probably fine?  We would see a mapping appear then disappear.
> > We'd have a (benign) race with rmap for truncating the PTEs (but it's
> > safe).  Page faults would be stopped though.
> >
> > Unfortunately, we'd have to write to the vma tree so that we could...
> > write to the vma tree.  We'd have to somehow ensure munmap() is done
> > with a gfp flag to ensure no failures as well...
> >
> > Maybe we should just call close on the vma again (and do whatever
> > call_mmap() needs to undo)?
>
> I take it back, that won't work.
>
> >
> > >
> > > Any security problems Jann...?
> > >
> > > It'd suck to have to bring back a partial complete case. Though I do note we
> > > handle errors from mmap_file() ok so we could still potentially handle that
> > > there, but would sort of semi-undo some of the point of the series.

I'm genuinely not opposed to a horrible, awful:

#ifdef CONFIG_ARM64
	if (file && file->f_ops == shmem_file_operations)
		vm_flags |= VM_MTE_ALLOWED;
#endif

Early in the operation prior to the arch_validate_flags() check.

Just to get this over the finish line (original idea credit to Vlastimil, insane
ugliness credit to me).

Could be in a __arch_workarounds() type function heavily commented...

I mean this is pretty gross but...


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 20:17               ` Lorenzo Stoakes
@ 2024-10-28 20:22                 ` Linus Torvalds
  2024-10-28 20:43                   ` Lorenzo Stoakes
  2024-10-28 21:00                   ` Vlastimil Babka
  0 siblings, 2 replies; 43+ messages in thread
From: Linus Torvalds @ 2024-10-28 20:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, Mark Brown, Andrew Morton, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, 28 Oct 2024 at 10:18, Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> I'm genuinely not opposed to a horrible, awful:
>
> #ifdef CONFIG_ARM64
>         if (file && file->f_ops == shmem_file_operations)
>                 vm_flags |= VM_MTE_ALLOWED;
> #endif
>
> Early in the operation prior to the arch_validate_flags() check.

I would just put it inside the arm64 code itself.

IOW, get rid of the VM_MTE_ALLOWED flag entirely, and just make the
arm64 arch_validate_flags() code do something like

        if (flags & VM_MTE) {
                if (file->f_ops != shmem_file_operations)
                        return false;
        }

and be done with it.

Considering that we only have that horrendous arch_validate_flags()
for two architectures, and that they both just have magical special
cases for MTE-like behavior, I do think that just making it be a hack
inside those functions is the way to go.

              Linus


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 20:22                 ` Linus Torvalds
@ 2024-10-28 20:43                   ` Lorenzo Stoakes
  2024-10-28 21:04                     ` Liam R. Howlett
  2024-10-28 21:05                     ` Mark Brown
  2024-10-28 21:00                   ` Vlastimil Babka
  1 sibling, 2 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-28 20:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Liam R. Howlett, Mark Brown, Andrew Morton, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, Oct 28, 2024 at 10:22:32AM -1000, Linus Torvalds wrote:
> On Mon, 28 Oct 2024 at 10:18, Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > I'm genuinely not opposed to a horrible, awful:
> >
> > #ifdef CONFIG_ARM64
> >         if (file && file->f_ops == shmem_file_operations)
> >                 vm_flags |= VM_MTE_ALLOWED;
> > #endif
> >
> > Early in the operation prior to the arch_validate_flags() check.
>
> I would just put it inside the arm64 code itself.
>
> IOW, get rid of the VM_MTE_ALLOWED flag entirely, and just make the
> arm64 arch_validate_flags() code do something like
>
>         if (flags & VM_MTE) {
>                 if (file->f_ops != shmem_file_operations)
>                         return false;
>         }
>
> and be done with it.
>
> Considering that we only have that horrendous arch_validate_flags()
> for two architectures, and that they both just have magical special
> cases for MTE-like behavior, I do think that just making it be a hack
> inside those functions is the way to go.
>
>               Linus

Ah yeah makes sense.

FWIW I just made a fix -for now- which implements it in the hideous way,
shown below.

We can maybe take that as a fix-patch for now and I can look at replacing
this tomorrow with something as you suggest properly.

My only concern is that arm people might not be happy and we get some hold
up here...

Thanks, Lorenzo


----8<----
From fb6c15c74ba0db57f18b08fc6d1e901676f25bf6 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Mon, 28 Oct 2024 20:36:49 +0000
Subject: [PATCH] mm: account for MTE in arm64 on mmap_region() operation

Correctly account for MTE on mmap_region(). We need to check this ahead of
the operation, the shmem mmap hook was doing it, but this is at a point
where a failure would mean we'd have to tear down a partially installed
VMA.

Avoid all this by adding a function to specifically handle this case.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/mmap.c  | 20 ++++++++++++++++++++
 mm/shmem.c |  3 ---
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8462de1ee583..83afa1ebfd75 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1575,6 +1575,24 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 	return error;
 }

+/*
+ * We check VMA flag validity early in the mmap() process, however this can
+ * cause issues for arm64 when using MTE, which requires that it be used with
+ * shmem and in this instance and only then is VM_MTE_ALLOWED set permitting
+ * this operation.
+ *
+ * To avoid having to tear down a partially complete mapping we do this ahead of
+ * time.
+ */
+static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags)
+{
+	if (!IS_ENABLED(CONFIG_ARM64))
+		return vm_flags;
+
+	if (shmem_file(file))
+		return vm_flags | VM_MTE_ALLOWED;
+}
+
 unsigned long mmap_region(struct file *file, unsigned long addr,
 			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
 			  struct list_head *uf)
@@ -1586,6 +1604,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	if (map_deny_write_exec(vm_flags, vm_flags))
 		return -EACCES;

+	vm_flags = arch_adjust_flags(file, vm_flags);
+
 	/* Allow architectures to sanity-check the vm_flags. */
 	if (!arch_validate_flags(vm_flags))
 		return -EINVAL;
diff --git a/mm/shmem.c b/mm/shmem.c
index 4ba1d00fabda..e87f5d6799a7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;

-	/* arm64 - allow memory tagging on RAM-based files */
-	vm_flags_set(vma, VM_MTE_ALLOWED);
-
 	file_accessed(file);
 	/* This is anonymous shared memory if it is unlinked at the time of mmap */
 	if (inode->i_nlink)
--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 18:57     ` Lorenzo Stoakes
  2024-10-28 19:05       ` Linus Torvalds
@ 2024-10-28 20:51       ` Mark Brown
  1 sibling, 0 replies; 43+ messages in thread
From: Mark Brown @ 2024-10-28 20:51 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	linux-kernel, linux-mm, Linus Torvalds, Peter Xu,
	linux-arm-kernel, Catalin Marinas, Will Deacon, Aishwarya TCV

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

On Mon, Oct 28, 2024 at 06:57:25PM +0000, Lorenzo Stoakes wrote:
> On Mon, Oct 28, 2024 at 06:29:36PM +0000, Mark Brown wrote:

> > any meaingful detail.  There's nothing particularly instructive in the
> > test logs, just plain reports that the tests failed:

> On these memory allocation failures, could you tell me what errno is? Could you
> check dmesg for anything strange?

Looks like this is mostly figured out already but JFTR:

As I said in the report there's nothing in the logs that I noticed,
anything there is that I've missed should be in the linked logs.  The
errnos I'm seeing are all:

# mmap(): Invalid argument (22)

> >   # # FAIL: mmap allocation

> Interesting that it MAP_FAIL's though. This could be arch_validate_flags() being
> moved around.

> Could you do me a further favour then and try a kernel at this commit with:

> 	/* Allow architectures to sanity-check the vm_flags. */
> 	if (!arch_validate_flags(vm_flags))
> 		return -EINVAL;

> In mmap_region() commented out?

Unsurprisingly given the above and the rest of the thread commenting out
that check causes the affected tests to pass, I didn't check for any
additional impacts.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 20:22                 ` Linus Torvalds
  2024-10-28 20:43                   ` Lorenzo Stoakes
@ 2024-10-28 21:00                   ` Vlastimil Babka
  2024-10-28 21:19                     ` Linus Torvalds
  1 sibling, 1 reply; 43+ messages in thread
From: Vlastimil Babka @ 2024-10-28 21:00 UTC (permalink / raw)
  To: Linus Torvalds, Lorenzo Stoakes
  Cc: Liam R. Howlett, Mark Brown, Andrew Morton, Jann Horn,
	linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On 10/28/24 21:22, Linus Torvalds wrote:
> On Mon, 28 Oct 2024 at 10:18, Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> I'm genuinely not opposed to a horrible, awful:
>>
>> #ifdef CONFIG_ARM64
>>         if (file && file->f_ops == shmem_file_operations)
>>                 vm_flags |= VM_MTE_ALLOWED;
>> #endif
>>
>> Early in the operation prior to the arch_validate_flags() check.
> 
> I would just put it inside the arm64 code itself.
> 
> IOW, get rid of the VM_MTE_ALLOWED flag entirely, and just make the
> arm64 arch_validate_flags() code do something like
> 
>         if (flags & VM_MTE) {
>                 if (file->f_ops != shmem_file_operations)
>                         return false;
>         }
> 
> and be done with it.

VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits():

        if (system_supports_mte() && (flags & MAP_ANONYMOUS))
                return VM_MTE_ALLOWED;

And there's also this in arch/arm64/include/asm/page.h

#define VM_DATA_DEFAULT_FLAGS     (VM_DATA_FLAGS_TSK_EXEC | VM_MTE_ALLOWED)

So it would need to all be considered in the validation if we wanted to
replace VM_MTE_ALLOWED completely?

> Considering that we only have that horrendous arch_validate_flags()
> for two architectures, and that they both just have magical special
> cases for MTE-like behavior, I do think that just making it be a hack
> inside those functions is the way to go.
> 
>               Linus



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 20:43                   ` Lorenzo Stoakes
@ 2024-10-28 21:04                     ` Liam R. Howlett
  2024-10-28 21:05                     ` Mark Brown
  1 sibling, 0 replies; 43+ messages in thread
From: Liam R. Howlett @ 2024-10-28 21:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Linus Torvalds, Mark Brown, Andrew Morton, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241028 16:43]:
> On Mon, Oct 28, 2024 at 10:22:32AM -1000, Linus Torvalds wrote:
> > On Mon, 28 Oct 2024 at 10:18, Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > I'm genuinely not opposed to a horrible, awful:
> > >
> > > #ifdef CONFIG_ARM64
> > >         if (file && file->f_ops == shmem_file_operations)
> > >                 vm_flags |= VM_MTE_ALLOWED;
> > > #endif
> > >
> > > Early in the operation prior to the arch_validate_flags() check.
> >
> > I would just put it inside the arm64 code itself.
> >
> > IOW, get rid of the VM_MTE_ALLOWED flag entirely, and just make the
> > arm64 arch_validate_flags() code do something like
> >
> >         if (flags & VM_MTE) {
> >                 if (file->f_ops != shmem_file_operations)
> >                         return false;
> >         }
> >
> > and be done with it.
> >
> > Considering that we only have that horrendous arch_validate_flags()
> > for two architectures, and that they both just have magical special
> > cases for MTE-like behavior, I do think that just making it be a hack
> > inside those functions is the way to go.
> >
> >               Linus
> 
> Ah yeah makes sense.
> 
> FWIW I just made a fix -for now- which implements it in the hideous way,
> shown below.
> 
> We can maybe take that as a fix-patch for now and I can look at replacing
> this tomorrow with something as you suggest properly.
> 
> My only concern is that arm people might not be happy and we get some hold
> up here...
> 
> Thanks, Lorenzo
> 
> 
> ----8<----
> From fb6c15c74ba0db57f18b08fc6d1e901676f25bf6 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Mon, 28 Oct 2024 20:36:49 +0000
> Subject: [PATCH] mm: account for MTE in arm64 on mmap_region() operation
> 
> Correctly account for MTE on mmap_region(). We need to check this ahead of
> the operation, the shmem mmap hook was doing it, but this is at a point
> where a failure would mean we'd have to tear down a partially installed
> VMA.
> 
> Avoid all this by adding a function to specifically handle this case.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  mm/mmap.c  | 20 ++++++++++++++++++++
>  mm/shmem.c |  3 ---
>  2 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8462de1ee583..83afa1ebfd75 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1575,6 +1575,24 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
>  	return error;
>  }
> 
> +/*
> + * We check VMA flag validity early in the mmap() process, however this can
> + * cause issues for arm64 when using MTE, which requires that it be used with
> + * shmem and in this instance and only then is VM_MTE_ALLOWED set permitting
> + * this operation.
> + *
> + * To avoid having to tear down a partially complete mapping we do this ahead of
> + * time.
> + */
> +static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags)

Is it worth adding an inline?

> +{
> +	if (!IS_ENABLED(CONFIG_ARM64))

CONFIG_ARM64_MTE .. otherwise VM_MTE_ALLOWED is 0 so, really doesn't
matter I guess.

> +		return vm_flags;
> +
> +	if (shmem_file(file))
> +		return vm_flags | VM_MTE_ALLOWED;

Would if (VM_MTE_ALLOWED && shmem_file(file)) allow for the pre-compiler
to remove some of this?  Also probably doesn't matter much.


> +}
> +
>  unsigned long mmap_region(struct file *file, unsigned long addr,
>  			  unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
>  			  struct list_head *uf)
> @@ -1586,6 +1604,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  	if (map_deny_write_exec(vm_flags, vm_flags))
>  		return -EACCES;
> 
> +	vm_flags = arch_adjust_flags(file, vm_flags);
> +
>  	/* Allow architectures to sanity-check the vm_flags. */
>  	if (!arch_validate_flags(vm_flags))
>  		return -EINVAL;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4ba1d00fabda..e87f5d6799a7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (ret)
>  		return ret;
> 
> -	/* arm64 - allow memory tagging on RAM-based files */
> -	vm_flags_set(vma, VM_MTE_ALLOWED);
> -
>  	file_accessed(file);
>  	/* This is anonymous shared memory if it is unlinked at the time of mmap */
>  	if (inode->i_nlink)
> --


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 20:43                   ` Lorenzo Stoakes
  2024-10-28 21:04                     ` Liam R. Howlett
@ 2024-10-28 21:05                     ` Mark Brown
  2024-10-28 21:28                       ` Lorenzo Stoakes
  1 sibling, 1 reply; 43+ messages in thread
From: Mark Brown @ 2024-10-28 21:05 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Linus Torvalds, Liam R. Howlett, Andrew Morton, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

[-- Attachment #1: Type: text/plain, Size: 1765 bytes --]

On Mon, Oct 28, 2024 at 08:43:08PM +0000, Lorenzo Stoakes wrote:

> +/*
> + * We check VMA flag validity early in the mmap() process, however this can
> + * cause issues for arm64 when using MTE, which requires that it be used with
> + * shmem and in this instance and only then is VM_MTE_ALLOWED set permitting
> + * this operation.
> + *
> + * To avoid having to tear down a partially complete mapping we do this ahead of
> + * time.
> + */
> +static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags)
> +{
> +	if (!IS_ENABLED(CONFIG_ARM64))
> +		return vm_flags;
> +
> +	if (shmem_file(file))
> +		return vm_flags | VM_MTE_ALLOWED;
> +}

This doesn't build:

mm/mmap.c:1595:1: error: non-void function does not return a value in all control paths [-Werror,-Wreturn-type]
 1595 | }
      | ^

with that corrected:

diff --git a/mm/mmap.c b/mm/mmap.c
index d1ab4301c671..cea051c5fef3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1587,11 +1587,10 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
  */
 static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags)
 {
-	if (!IS_ENABLED(CONFIG_ARM64))
-		return vm_flags;
+	if (IS_ENABLED(CONFIG_ARM64) && shmem_file(file))
+		vm_flags |= VM_MTE_ALLOWED;
 
-	if (shmem_file(file))
-		return vm_flags | VM_MTE_ALLOWED;
+	return vm_flags;
 }
 
 unsigned long mmap_region(struct file *file, unsigned long addr,

the relevant tests all pass for me.

Tested-by: Mark Brown <broonie@kernel.org>

I'd have expected arch_adjust_flags() to be something overridden by the
arch headers (probably like arch_calc_vm_prot_bits() and friends), but
if this is juat a short lived fix it's probably not worth the trouble.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 21:00                   ` Vlastimil Babka
@ 2024-10-28 21:19                     ` Linus Torvalds
  2024-10-28 21:28                       ` Vlastimil Babka
  0 siblings, 1 reply; 43+ messages in thread
From: Linus Torvalds @ 2024-10-28 21:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Lorenzo Stoakes, Liam R. Howlett, Mark Brown, Andrew Morton,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, 28 Oct 2024 at 11:00, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits():
>
>         if (system_supports_mte() && (flags & MAP_ANONYMOUS))
>                 return VM_MTE_ALLOWED;

Yeah, but that should just move into arch_validate_flags() too.
There's no reason why that's done in a separate place.

                     Linus


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 21:05                     ` Mark Brown
@ 2024-10-28 21:28                       ` Lorenzo Stoakes
  0 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-28 21:28 UTC (permalink / raw)
  To: Mark Brown
  Cc: Linus Torvalds, Liam R. Howlett, Andrew Morton, Vlastimil Babka,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, Oct 28, 2024 at 09:05:33PM +0000, Mark Brown wrote:
> On Mon, Oct 28, 2024 at 08:43:08PM +0000, Lorenzo Stoakes wrote:
>
> > +/*
> > + * We check VMA flag validity early in the mmap() process, however this can
> > + * cause issues for arm64 when using MTE, which requires that it be used with
> > + * shmem and in this instance and only then is VM_MTE_ALLOWED set permitting
> > + * this operation.
> > + *
> > + * To avoid having to tear down a partially complete mapping we do this ahead of
> > + * time.
> > + */
> > +static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags)
> > +{
> > +	if (!IS_ENABLED(CONFIG_ARM64))
> > +		return vm_flags;
> > +
> > +	if (shmem_file(file))
> > +		return vm_flags | VM_MTE_ALLOWED;
> > +}
>
> This doesn't build:
>
> mm/mmap.c:1595:1: error: non-void function does not return a value in all control paths [-Werror,-Wreturn-type]
>  1595 | }
>       | ^

Doh that'll teach me for rushing this...

>
> with that corrected:
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d1ab4301c671..cea051c5fef3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1587,11 +1587,10 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
>   */
>  static vm_flags_t arch_adjust_flags(struct file *file, vm_flags_t vm_flags)
>  {
> -	if (!IS_ENABLED(CONFIG_ARM64))
> -		return vm_flags;
> +	if (IS_ENABLED(CONFIG_ARM64) && shmem_file(file))
> +		vm_flags |= VM_MTE_ALLOWED;
>
> -	if (shmem_file(file))
> -		return vm_flags | VM_MTE_ALLOWED;
> +	return vm_flags;
>  }
>
>  unsigned long mmap_region(struct file *file, unsigned long addr,
>
> the relevant tests all pass for me.
>
> Tested-by: Mark Brown <broonie@kernel.org>

Thanks!

>
> I'd have expected arch_adjust_flags() to be something overridden by the
> arch headers (probably like arch_calc_vm_prot_bits() and friends), but
> if this is juat a short lived fix it's probably not worth the trouble.

Yeah this is just a sample solution that I had put together when Linus
suggested a sensible alternative which I'll code up...

Good to confirm this is definitely the issue thanks for testing!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 21:19                     ` Linus Torvalds
@ 2024-10-28 21:28                       ` Vlastimil Babka
  2024-10-28 22:14                         ` Lorenzo Stoakes
  0 siblings, 1 reply; 43+ messages in thread
From: Vlastimil Babka @ 2024-10-28 21:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Lorenzo Stoakes, Liam R. Howlett, Mark Brown, Andrew Morton,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On 10/28/24 22:19, Linus Torvalds wrote:
> On Mon, 28 Oct 2024 at 11:00, Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits():
>>
>>         if (system_supports_mte() && (flags & MAP_ANONYMOUS))
>>                 return VM_MTE_ALLOWED;
> 
> Yeah, but that should just move into arch_validate_flags() too.
> There's no reason why that's done in a separate place.
> 
>                      Linus

Right, and VM_DATA_DEFAULT_FLAGS is only used in do_brk_flags() which is
also an anonymous VMA, and it doesn't perform arch_validate_flags() anyway.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 21:28                       ` Vlastimil Babka
@ 2024-10-28 22:14                         ` Lorenzo Stoakes
  2024-10-29  7:50                           ` Vlastimil Babka
                                             ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-28 22:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Liam R. Howlett, Mark Brown, Andrew Morton,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Mon, Oct 28, 2024 at 10:28:47PM +0100, Vlastimil Babka wrote:
> On 10/28/24 22:19, Linus Torvalds wrote:
> > On Mon, 28 Oct 2024 at 11:00, Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits():
> >>
> >>         if (system_supports_mte() && (flags & MAP_ANONYMOUS))
> >>                 return VM_MTE_ALLOWED;
> >
> > Yeah, but that should just move into arch_validate_flags() too.
> > There's no reason why that's done in a separate place.
> >
> >                      Linus
>
> Right, and VM_DATA_DEFAULT_FLAGS is only used in do_brk_flags() which is
> also an anonymous VMA, and it doesn't perform arch_validate_flags() anyway.

Unfortunately we can't do this and have everything work entirely
consistently.

This is because, while adding a file parameter to arch_validate_flags()
lets us do this shmem check, we can't be sure MAP_ANON was set and we do
not have access to this information at the point of the
arch_validate_flags() check.

We could check file == NULL, but this then excludes the MAP_ANON |
MAP_HUGETLB case which is (probably accidentally) permitted by
arch_calc_vm_flag_bits() and for the purposes of this fix we probably just
want to keep behaviour identical.

We could alternatively check the file is shmem _or_ hugetlb but this would
amount to a change in behaviour and not sure we want to go there.

Anyway I attach a patch that does what you suggest, I actually compiled
this one (!) but don't have a system set up to test it (Mark?)

If we bring this in it should probably be a separate commit...

----8<----
From 247003cd2a4b5f4fc2dac97f5ef7e473a47f4324 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Mon, 28 Oct 2024 22:05:44 +0000
Subject: [PATCH] mm: perform MTE check within arm64 hook entirely

It doesn't make sense to have shmem explicitly check this one arch-specific
case, it is arch-specific, so the arch should handle it. We know shmem is a
case in which we want to permit MTE, so simply check for this directly.

This also fixes the issue with checking arch_validate_flags() early, which
would otherwise break mmap_region().

In order to implement this we must pass a file pointer, and additionally
update the sparc code to accept this parameter too.

We'd ideally like to have eliminated the arch_calc_vm_flag_bits() case, but
we risk inadvertently changing behaviour as we do not have mmap() flags
available at the point of the arch_validate_flags() check and a MAP_ANON |
MAP_HUGETLB case would be accepted for MTE currently, but a MAP_SHARED |
MAP_HUGETLB would not.

This is likely an oversight but we want to try to keep behaviour identical
to before in this patch.

So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
MAP_ANON.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 arch/arm64/include/asm/mman.h | 18 ++++++++++++++----
 arch/sparc/include/asm/mman.h |  5 +++--
 include/linux/mman.h          |  2 +-
 mm/mmap.c                     |  4 ++--
 mm/mprotect.c                 |  2 +-
 mm/shmem.c                    |  3 ---
 6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
index 9e39217b4afb..44b7c8a1dd67 100644
--- a/arch/arm64/include/asm/mman.h
+++ b/arch/arm64/include/asm/mman.h
@@ -6,7 +6,9 @@

 #ifndef BUILD_VDSO
 #include <linux/compiler.h>
+#include <linux/fs.h>
 #include <linux/types.h>
+#include <linux/shmem_fs.h>

 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
 	unsigned long pkey)
@@ -60,15 +62,23 @@ static inline bool arch_validate_prot(unsigned long prot,
 }
 #define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr)

-static inline bool arch_validate_flags(unsigned long vm_flags)
+static inline bool arch_validate_flags(struct file *file, unsigned long vm_flags)
 {
 	if (!system_supports_mte())
 		return true;

-	/* only allow VM_MTE if VM_MTE_ALLOWED has been set previously */
-	return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
+	if (!(vm_flags & VM_MTE))
+		return true;
+
+	if (vm_flags & VM_MTE_ALLOWED)
+		return true;
+
+	if (shmem_file(file))
+		return true;
+
+	return false;
 }
-#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
+#define arch_validate_flags(file, vm_flags) arch_validate_flags(file, vm_flags)

 #endif /* !BUILD_VDSO */

diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
index af9c10c83dc5..d426e1f7c2c1 100644
--- a/arch/sparc/include/asm/mman.h
+++ b/arch/sparc/include/asm/mman.h
@@ -10,6 +10,7 @@ int sparc_mmap_check(unsigned long addr, unsigned long len);

 #ifdef CONFIG_SPARC64
 #include <asm/adi_64.h>
+#include <linux/fs.h>

 static inline void ipi_set_tstate_mcde(void *arg)
 {
@@ -54,11 +55,11 @@ static inline int sparc_validate_prot(unsigned long prot, unsigned long addr)
 	return 1;
 }

-#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
+#define arch_validate_flags(file, vm_flags) arch_validate_flags(file, vm_flags)
 /* arch_validate_flags() - Ensure combination of flags is valid for a
  *	VMA.
  */
-static inline bool arch_validate_flags(unsigned long vm_flags)
+static inline bool arch_validate_flags(struct file *file, unsigned long vm_flags)
 {
 	/* If ADI is being enabled on this VMA, check for ADI
 	 * capability on the platform and ensure VMA is suitable
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 8ddca62d6460..82e6488026b7 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -117,7 +117,7 @@ static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
  *
  * Returns true if the VM_* flags are valid.
  */
-static inline bool arch_validate_flags(unsigned long flags)
+static inline bool arch_validate_flags(struct file *file, unsigned long flags)
 {
 	return true;
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 8462de1ee583..e06171d243ef 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1504,7 +1504,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,

 #ifdef CONFIG_SPARC64
 	/* TODO: Fix SPARC ADI! */
-	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
+	WARN_ON_ONCE(!arch_validate_flags(file, vm_flags));
 #endif

 	/* Lock the VMA since it is modified after insertion into VMA tree */
@@ -1587,7 +1587,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		return -EACCES;

 	/* Allow architectures to sanity-check the vm_flags. */
-	if (!arch_validate_flags(vm_flags))
+	if (!arch_validate_flags(file, vm_flags))
 		return -EINVAL;

 	/* Map writable and ensure this isn't a sealed memfd. */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f450af3252e..c6db98b893fc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -816,7 +816,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		}

 		/* Allow architectures to sanity-check the new flags */
-		if (!arch_validate_flags(newflags)) {
+		if (!arch_validate_flags(vma->vm_file, newflags)) {
 			error = -EINVAL;
 			break;
 		}
diff --git a/mm/shmem.c b/mm/shmem.c
index 4ba1d00fabda..e87f5d6799a7 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;

-	/* arm64 - allow memory tagging on RAM-based files */
-	vm_flags_set(vma, VM_MTE_ALLOWED);
-
 	file_accessed(file);
 	/* This is anonymous shared memory if it is unlinked at the time of mmap */
 	if (inode->i_nlink)
--
2.47.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 22:14                         ` Lorenzo Stoakes
@ 2024-10-29  7:50                           ` Vlastimil Babka
  2024-10-29 10:23                             ` Lorenzo Stoakes
  2024-10-29 12:33                           ` Mark Brown
  2024-10-29 15:04                           ` Catalin Marinas
  2 siblings, 1 reply; 43+ messages in thread
From: Vlastimil Babka @ 2024-10-29  7:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Linus Torvalds, Liam R. Howlett, Mark Brown, Andrew Morton,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On 10/28/24 23:14, Lorenzo Stoakes wrote:
> On Mon, Oct 28, 2024 at 10:28:47PM +0100, Vlastimil Babka wrote:
>> On 10/28/24 22:19, Linus Torvalds wrote:
>> > On Mon, 28 Oct 2024 at 11:00, Vlastimil Babka <vbabka@suse.cz> wrote:
>> >>
>> >> VM_MTE_ALLOWED is also set by arm64's arch_calc_vm_flag_bits():
>> >>
>> >>         if (system_supports_mte() && (flags & MAP_ANONYMOUS))
>> >>                 return VM_MTE_ALLOWED;
>> >
>> > Yeah, but that should just move into arch_validate_flags() too.
>> > There's no reason why that's done in a separate place.
>> >
>> >                      Linus
>>
>> Right, and VM_DATA_DEFAULT_FLAGS is only used in do_brk_flags() which is
>> also an anonymous VMA, and it doesn't perform arch_validate_flags() anyway.
> 
> Unfortunately we can't do this and have everything work entirely
> consistently.

Consistently with the current implementation, which seems inconsiscent wrt
hugetlbfs.

> This is because, while adding a file parameter to arch_validate_flags()
> lets us do this shmem check, we can't be sure MAP_ANON was set and we do
> not have access to this information at the point of the
> arch_validate_flags() check.
> 
> We could check file == NULL, but this then excludes the MAP_ANON |
> MAP_HUGETLB case which is (probably accidentally) permitted by
> arch_calc_vm_flag_bits() and for the purposes of this fix we probably just
> want to keep behaviour identical.
> 
> We could alternatively check the file is shmem _or_ hugetlb but this would
> amount to a change in behaviour and not sure we want to go there.
> 
> Anyway I attach a patch that does what you suggest, I actually compiled
> this one (!) but don't have a system set up to test it (Mark?)
> 
> If we bring this in it should probably be a separate commit...
> 
> ----8<----
> From 247003cd2a4b5f4fc2dac97f5ef7e473a47f4324 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Mon, 28 Oct 2024 22:05:44 +0000
> Subject: [PATCH] mm: perform MTE check within arm64 hook entirely
> 
> It doesn't make sense to have shmem explicitly check this one arch-specific
> case, it is arch-specific, so the arch should handle it. We know shmem is a
> case in which we want to permit MTE, so simply check for this directly.
> 
> This also fixes the issue with checking arch_validate_flags() early, which
> would otherwise break mmap_region().
> 
> In order to implement this we must pass a file pointer, and additionally
> update the sparc code to accept this parameter too.
> 
> We'd ideally like to have eliminated the arch_calc_vm_flag_bits() case, but
> we risk inadvertently changing behaviour as we do not have mmap() flags
> available at the point of the arch_validate_flags() check and a MAP_ANON |
> MAP_HUGETLB case would be accepted for MTE currently, but a MAP_SHARED |
> MAP_HUGETLB would not.
> 
> This is likely an oversight but we want to try to keep behaviour identical
> to before in this patch.

If it's confirmed to be oversight and MAP_SHARED hugetlbfs should be
allowed, we could add another is_file_hugepages() condition

> So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
> MAP_ANON.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  arch/arm64/include/asm/mman.h | 18 ++++++++++++++----
>  arch/sparc/include/asm/mman.h |  5 +++--
>  include/linux/mman.h          |  2 +-
>  mm/mmap.c                     |  4 ++--
>  mm/mprotect.c                 |  2 +-
>  mm/shmem.c                    |  3 ---
>  6 files changed, 21 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> index 9e39217b4afb..44b7c8a1dd67 100644
> --- a/arch/arm64/include/asm/mman.h
> +++ b/arch/arm64/include/asm/mman.h
> @@ -6,7 +6,9 @@
> 
>  #ifndef BUILD_VDSO
>  #include <linux/compiler.h>
> +#include <linux/fs.h>
>  #include <linux/types.h>
> +#include <linux/shmem_fs.h>
> 
>  static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
>  	unsigned long pkey)
> @@ -60,15 +62,23 @@ static inline bool arch_validate_prot(unsigned long prot,
>  }
>  #define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr)
> 
> -static inline bool arch_validate_flags(unsigned long vm_flags)
> +static inline bool arch_validate_flags(struct file *file, unsigned long vm_flags)
>  {
>  	if (!system_supports_mte())
>  		return true;
> 
> -	/* only allow VM_MTE if VM_MTE_ALLOWED has been set previously */
> -	return !(vm_flags & VM_MTE) || (vm_flags & VM_MTE_ALLOWED);
> +	if (!(vm_flags & VM_MTE))
> +		return true;
> +
> +	if (vm_flags & VM_MTE_ALLOWED)
> +		return true;
> +
> +	if (shmem_file(file))
> +		return true;
> +
> +	return false;
>  }
> -#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
> +#define arch_validate_flags(file, vm_flags) arch_validate_flags(file, vm_flags)
> 
>  #endif /* !BUILD_VDSO */
> 
> diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
> index af9c10c83dc5..d426e1f7c2c1 100644
> --- a/arch/sparc/include/asm/mman.h
> +++ b/arch/sparc/include/asm/mman.h
> @@ -10,6 +10,7 @@ int sparc_mmap_check(unsigned long addr, unsigned long len);
> 
>  #ifdef CONFIG_SPARC64
>  #include <asm/adi_64.h>
> +#include <linux/fs.h>
> 
>  static inline void ipi_set_tstate_mcde(void *arg)
>  {
> @@ -54,11 +55,11 @@ static inline int sparc_validate_prot(unsigned long prot, unsigned long addr)
>  	return 1;
>  }
> 
> -#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
> +#define arch_validate_flags(file, vm_flags) arch_validate_flags(file, vm_flags)
>  /* arch_validate_flags() - Ensure combination of flags is valid for a
>   *	VMA.
>   */
> -static inline bool arch_validate_flags(unsigned long vm_flags)
> +static inline bool arch_validate_flags(struct file *file, unsigned long vm_flags)
>  {
>  	/* If ADI is being enabled on this VMA, check for ADI
>  	 * capability on the platform and ensure VMA is suitable
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 8ddca62d6460..82e6488026b7 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -117,7 +117,7 @@ static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
>   *
>   * Returns true if the VM_* flags are valid.
>   */
> -static inline bool arch_validate_flags(unsigned long flags)
> +static inline bool arch_validate_flags(struct file *file, unsigned long flags)
>  {
>  	return true;
>  }
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8462de1ee583..e06171d243ef 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1504,7 +1504,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> 
>  #ifdef CONFIG_SPARC64
>  	/* TODO: Fix SPARC ADI! */
> -	WARN_ON_ONCE(!arch_validate_flags(vm_flags));
> +	WARN_ON_ONCE(!arch_validate_flags(file, vm_flags));
>  #endif
> 
>  	/* Lock the VMA since it is modified after insertion into VMA tree */
> @@ -1587,7 +1587,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  		return -EACCES;
> 
>  	/* Allow architectures to sanity-check the vm_flags. */
> -	if (!arch_validate_flags(vm_flags))
> +	if (!arch_validate_flags(file, vm_flags))
>  		return -EINVAL;
> 
>  	/* Map writable and ensure this isn't a sealed memfd. */
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 6f450af3252e..c6db98b893fc 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -816,7 +816,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
>  		}
> 
>  		/* Allow architectures to sanity-check the new flags */
> -		if (!arch_validate_flags(newflags)) {
> +		if (!arch_validate_flags(vma->vm_file, newflags)) {
>  			error = -EINVAL;
>  			break;
>  		}
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4ba1d00fabda..e87f5d6799a7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (ret)
>  		return ret;
> 
> -	/* arm64 - allow memory tagging on RAM-based files */
> -	vm_flags_set(vma, VM_MTE_ALLOWED);
> -
>  	file_accessed(file);
>  	/* This is anonymous shared memory if it is unlinked at the time of mmap */
>  	if (inode->i_nlink)
> --
> 2.47.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-29  7:50                           ` Vlastimil Babka
@ 2024-10-29 10:23                             ` Lorenzo Stoakes
  0 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-29 10:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Liam R. Howlett, Mark Brown, Andrew Morton,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Tue, Oct 29, 2024 at 08:50:11AM +0100, Vlastimil Babka wrote:
> On 10/28/24 23:14, Lorenzo Stoakes wrote:

<snip>

> > Unfortunately we can't do this and have everything work entirely
> > consistently.
>
> Consistently with the current implementation, which seems inconsiscent wrt
> hugetlbfs.

Yeah 100% I think this is an oversight.

<snip>

> > This is likely an oversight but we want to try to keep behaviour identical
> > to before in this patch.
>
> If it's confirmed to be oversight and MAP_SHARED hugetlbfs should be
> allowed, we could add another is_file_hugepages() condition

Yeah I mean I think this _is_ an oversight but I think best to delay a fix for
that to a follow-up series and just do this for the hotfix. Might throw in a
comment.

Andrew requested a comment on the shmem_file() check anyway so can sling this in
there!

>
> > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
> > MAP_ANON.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

<snip>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 22:14                         ` Lorenzo Stoakes
  2024-10-29  7:50                           ` Vlastimil Babka
@ 2024-10-29 12:33                           ` Mark Brown
  2024-10-29 12:41                             ` Lorenzo Stoakes
  2024-10-29 15:04                           ` Catalin Marinas
  2 siblings, 1 reply; 43+ messages in thread
From: Mark Brown @ 2024-10-29 12:33 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Andrew Morton,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

[-- Attachment #1: Type: text/plain, Size: 536 bytes --]

On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote:

> It doesn't make sense to have shmem explicitly check this one arch-specific
> case, it is arch-specific, so the arch should handle it. We know shmem is a
> case in which we want to permit MTE, so simply check for this directly.
> 
> This also fixes the issue with checking arch_validate_flags() early, which
> would otherwise break mmap_region().

The relevant tests all pass with this on today's pending-fixes.

Tested-by: Mark Brown <broonie@kernel.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-29 12:33                           ` Mark Brown
@ 2024-10-29 12:41                             ` Lorenzo Stoakes
  0 siblings, 0 replies; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-29 12:41 UTC (permalink / raw)
  To: Mark Brown
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Andrew Morton,
	Jann Horn, linux-kernel, linux-mm, Peter Xu, linux-arm-kernel,
	Catalin Marinas, Will Deacon, Aishwarya TCV

On Tue, Oct 29, 2024 at 12:33:13PM +0000, Mark Brown wrote:
> On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote:
>
> > It doesn't make sense to have shmem explicitly check this one arch-specific
> > case, it is arch-specific, so the arch should handle it. We know shmem is a
> > case in which we want to permit MTE, so simply check for this directly.
> >
> > This also fixes the issue with checking arch_validate_flags() early, which
> > would otherwise break mmap_region().
>
> The relevant tests all pass with this on today's pending-fixes.

Thanks, am respinning series now, will add your tag!

>
> Tested-by: Mark Brown <broonie@kernel.org>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-28 22:14                         ` Lorenzo Stoakes
  2024-10-29  7:50                           ` Vlastimil Babka
  2024-10-29 12:33                           ` Mark Brown
@ 2024-10-29 15:04                           ` Catalin Marinas
  2024-10-29 15:16                             ` Lorenzo Stoakes
  2 siblings, 1 reply; 43+ messages in thread
From: Catalin Marinas @ 2024-10-29 15:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Mark Brown,
	Andrew Morton, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Will Deacon, Aishwarya TCV

Hi Lorenzo,

Thanks for trying to fix this.

On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote:
> From 247003cd2a4b5f4fc2dac97f5ef7e473a47f4324 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Mon, 28 Oct 2024 22:05:44 +0000
> Subject: [PATCH] mm: perform MTE check within arm64 hook entirely
> 
> It doesn't make sense to have shmem explicitly check this one arch-specific
> case, it is arch-specific, so the arch should handle it. We know shmem is a
> case in which we want to permit MTE, so simply check for this directly.
> 
> This also fixes the issue with checking arch_validate_flags() early, which
> would otherwise break mmap_region().
> 
> In order to implement this we must pass a file pointer, and additionally
> update the sparc code to accept this parameter too.
> 
> We'd ideally like to have eliminated the arch_calc_vm_flag_bits() case, but
> we risk inadvertently changing behaviour as we do not have mmap() flags
> available at the point of the arch_validate_flags() check and a MAP_ANON |
> MAP_HUGETLB case would be accepted for MTE currently, but a MAP_SHARED |
> MAP_HUGETLB would not.
> 
> This is likely an oversight but we want to try to keep behaviour identical
> to before in this patch.

MAP_HUGETLB support for MTE is only in -next currently, so there
wouldn't be any ABI change if we also allowed MAP_SHARED | MAP_HUGETLB.
In 6.12, MAP_HUGETLB is not allowed to have PROT_MTE.

> So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
> MAP_ANON.
[...]
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 4ba1d00fabda..e87f5d6799a7 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (ret)
>  		return ret;
> 
> -	/* arm64 - allow memory tagging on RAM-based files */
> -	vm_flags_set(vma, VM_MTE_ALLOWED);

This breaks arm64 KVM if the VMM uses shared mappings for the memory
slots (which is possible). We have kvm_vma_mte_allowed() that checks for
the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly.

I need to read this thread properly but why not pass the file argument
to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there?

-- 
Catalin


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-29 15:04                           ` Catalin Marinas
@ 2024-10-29 15:16                             ` Lorenzo Stoakes
  2024-10-29 16:22                               ` Catalin Marinas
  0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-29 15:16 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Mark Brown,
	Andrew Morton, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Will Deacon, Aishwarya TCV

On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote:
> Hi Lorenzo,
>
> Thanks for trying to fix this.
>
> On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote:
> > From 247003cd2a4b5f4fc2dac97f5ef7e473a47f4324 Mon Sep 17 00:00:00 2001
> > From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Date: Mon, 28 Oct 2024 22:05:44 +0000
> > Subject: [PATCH] mm: perform MTE check within arm64 hook entirely
> >
> > It doesn't make sense to have shmem explicitly check this one arch-specific
> > case, it is arch-specific, so the arch should handle it. We know shmem is a
> > case in which we want to permit MTE, so simply check for this directly.
> >
> > This also fixes the issue with checking arch_validate_flags() early, which
> > would otherwise break mmap_region().
> >
> > In order to implement this we must pass a file pointer, and additionally
> > update the sparc code to accept this parameter too.
> >
> > We'd ideally like to have eliminated the arch_calc_vm_flag_bits() case, but
> > we risk inadvertently changing behaviour as we do not have mmap() flags
> > available at the point of the arch_validate_flags() check and a MAP_ANON |
> > MAP_HUGETLB case would be accepted for MTE currently, but a MAP_SHARED |
> > MAP_HUGETLB would not.
> >
> > This is likely an oversight but we want to try to keep behaviour identical
> > to before in this patch.
>
> MAP_HUGETLB support for MTE is only in -next currently, so there
> wouldn't be any ABI change if we also allowed MAP_SHARED | MAP_HUGETLB.
> In 6.12, MAP_HUGETLB is not allowed to have PROT_MTE.
>
> > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
> > MAP_ANON.
> [...]
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 4ba1d00fabda..e87f5d6799a7 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> >  	if (ret)
> >  		return ret;
> >
> > -	/* arm64 - allow memory tagging on RAM-based files */
> > -	vm_flags_set(vma, VM_MTE_ALLOWED);
>
> This breaks arm64 KVM if the VMM uses shared mappings for the memory
> slots (which is possible). We have kvm_vma_mte_allowed() that checks for
> the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly.

Ugh yup missed that thanks.

>
> I need to read this thread properly but why not pass the file argument
> to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there?

Can't really do that as it is entangled in a bunch of other stuff,
e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch
of places including arch code and... etc. etc.

And definitely no good for a hotfix that has to be backported.

I suggest instead we instead don't drop the yucky shmem thing, which will
set VM_MTE_ALLOWED for shmem, with arch_calc_vm_flag_bits() still setting
it for MAP_ANON, but the other changes will mean the arch_validate_flags()
will be fixed too.

So this just means not dropping the mm/shmem.c bit basically and everything
should 'just work'?

>
> --
> Catalin

But we definitely need to find a better way post-hotfix to sort all this
out I'm sure you agree :)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-29 15:16                             ` Lorenzo Stoakes
@ 2024-10-29 16:22                               ` Catalin Marinas
  2024-10-29 16:36                                 ` Lorenzo Stoakes
  0 siblings, 1 reply; 43+ messages in thread
From: Catalin Marinas @ 2024-10-29 16:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Mark Brown,
	Andrew Morton, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Will Deacon, Aishwarya TCV

On Tue, Oct 29, 2024 at 03:16:00PM +0000, Lorenzo Stoakes wrote:
> On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote:
> > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote:
> > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
> > > MAP_ANON.
> > [...]
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 4ba1d00fabda..e87f5d6799a7 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> > >  	if (ret)
> > >  		return ret;
> > >
> > > -	/* arm64 - allow memory tagging on RAM-based files */
> > > -	vm_flags_set(vma, VM_MTE_ALLOWED);
> >
> > This breaks arm64 KVM if the VMM uses shared mappings for the memory
> > slots (which is possible). We have kvm_vma_mte_allowed() that checks for
> > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly.
> 
> Ugh yup missed that thanks.
> 
> > I need to read this thread properly but why not pass the file argument
> > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there?
> 
> Can't really do that as it is entangled in a bunch of other stuff,
> e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch
> of places including arch code and... etc. etc.

Not calc_vm_prot_bits() but calc_vm_flag_bits().
arch_calc_vm_flag_bits() is only implemented by two architectures -
arm64 and parisc and calc_vm_flag_bits() is only called from do_mmap().

Basically we want to set VM_MTE_ALLOWED early during the mmap() call
and, at the time, my thinking was to do it in calc_vm_flag_bits(). The
calc_vm_prot_bits() OTOH is also called on the mprotect() path and is
responsible for translating PROT_MTE into a VM_MTE flag without any
checks. arch_validate_flags() would check if VM_MTE comes together with
VM_MTE_ALLOWED. But, as in the KVM case, that's not the only function
checking VM_MTE_ALLOWED.

Since calc_vm_flag_bits() did not take a file argument, the lazy
approach was to add the flag explicitly for shmem (and hugetlbfs in
-next). But I think it would be easier to just add the file argument to
calc_vm_flag_bits() and do the check in the arch code to return
VM_MTE_ALLOWED. AFAICT, this is called before mmap_region() and
arch_validate_flags() (unless I missed something in the recent
reworking).

> I suggest instead we instead don't drop the yucky shmem thing, which will
> set VM_MTE_ALLOWED for shmem, with arch_calc_vm_flag_bits() still setting
> it for MAP_ANON, but the other changes will mean the arch_validate_flags()
> will be fixed too.
> 
> So this just means not dropping the mm/shmem.c bit basically and everything
> should 'just work'?

If we can't get the calc_vm_flag_bits() approach to work, I'm fine with
this as a fix and we'll look to do it properly from 6.13.

-- 
Catalin


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-29 16:22                               ` Catalin Marinas
@ 2024-10-29 16:36                                 ` Lorenzo Stoakes
  2024-10-29 17:02                                   ` Catalin Marinas
  0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-29 16:36 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Mark Brown,
	Andrew Morton, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Will Deacon, Aishwarya TCV

On Tue, Oct 29, 2024 at 04:22:42PM +0000, Catalin Marinas wrote:
> On Tue, Oct 29, 2024 at 03:16:00PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote:
> > > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote:
> > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
> > > > MAP_ANON.
> > > [...]
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index 4ba1d00fabda..e87f5d6799a7 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> > > >  	if (ret)
> > > >  		return ret;
> > > >
> > > > -	/* arm64 - allow memory tagging on RAM-based files */
> > > > -	vm_flags_set(vma, VM_MTE_ALLOWED);
> > >
> > > This breaks arm64 KVM if the VMM uses shared mappings for the memory
> > > slots (which is possible). We have kvm_vma_mte_allowed() that checks for
> > > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly.
> >
> > Ugh yup missed that thanks.
> >
> > > I need to read this thread properly but why not pass the file argument
> > > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there?
> >
> > Can't really do that as it is entangled in a bunch of other stuff,
> > e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch
> > of places including arch code and... etc. etc.
>
> Not calc_vm_prot_bits() but calc_vm_flag_bits().
> arch_calc_vm_flag_bits() is only implemented by two architectures -
> arm64 and parisc and calc_vm_flag_bits() is only called from do_mmap().
>
> Basically we want to set VM_MTE_ALLOWED early during the mmap() call
> and, at the time, my thinking was to do it in calc_vm_flag_bits(). The
> calc_vm_prot_bits() OTOH is also called on the mprotect() path and is
> responsible for translating PROT_MTE into a VM_MTE flag without any
> checks. arch_validate_flags() would check if VM_MTE comes together with
> VM_MTE_ALLOWED. But, as in the KVM case, that's not the only function
> checking VM_MTE_ALLOWED.
>
> Since calc_vm_flag_bits() did not take a file argument, the lazy
> approach was to add the flag explicitly for shmem (and hugetlbfs in
> -next). But I think it would be easier to just add the file argument to
> calc_vm_flag_bits() and do the check in the arch code to return
> VM_MTE_ALLOWED. AFAICT, this is called before mmap_region() and
> arch_validate_flags() (unless I missed something in the recent
> reworking).

I mean I totally get why you're suggesting it - it's the right _place_ but...

It would require changes to a ton of code which is no good for a backport
and we don't _need_ to do it.

I'd rather do the smallest delta at this point, as I am not a huge fan of
sticking it in here (I mean your point is wholly valid - it's at a better
place to do so and we can change flags here, it's just - it's not where you
expect to do this obviously).

I mean for instance in arch/x86/kernel/cpu/sgx/encl.c (a file I'd _really_
like us not to touch here by the way) we'd have to what pass NULL?

I mean passing file to arch_validate_flags() is icky, but it makes some
sense since we _always_ have that available and meaningful at the point of
invocation, if we added it to arch_calc_vm_flag_bits() now there are places
where it's not available.

And then we're assuming we can just pass NULL... and it becomes a confusing
mess really I think.

I also worry we might somehow break something somewhere this way, we're
already exposed to subtle issues here.

Alternatively, we can change my series by 2 lines (as I've already asked
Andrew to do), everything still works, the fix applies, the VM_MTE_ALLOWED
flag works still in an obvious way (it behaves exactly as it did before)
and all is well with the world and we can frolick in the fields freely and
joyously :)

>
> > I suggest instead we instead don't drop the yucky shmem thing, which will
> > set VM_MTE_ALLOWED for shmem, with arch_calc_vm_flag_bits() still setting
> > it for MAP_ANON, but the other changes will mean the arch_validate_flags()
> > will be fixed too.
> >
> > So this just means not dropping the mm/shmem.c bit basically and everything
> > should 'just work'?
>
> If we can't get the calc_vm_flag_bits() approach to work, I'm fine with
> this as a fix and we'll look to do it properly from 6.13.

I think overwhelmingly since I'm going to be backporting this and as a
hotfix it's better to just leave the shmem stuff in and leave the rest the
same.

I really would like us to figure out a better way overall from >=6.13
though and replace all this with something saner :>)

Am happy to help and collaborate on that!

>
> --
> Catalin

Cheers, and sorry to fiddle with arm64 stuff here, sadly happens to just be
where this issue becomes a thing with this hotfix!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-29 16:36                                 ` Lorenzo Stoakes
@ 2024-10-29 17:02                                   ` Catalin Marinas
  2024-10-29 17:28                                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 43+ messages in thread
From: Catalin Marinas @ 2024-10-29 17:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Mark Brown,
	Andrew Morton, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Will Deacon, Aishwarya TCV

On Tue, Oct 29, 2024 at 04:36:32PM +0000, Lorenzo Stoakes wrote:
> On Tue, Oct 29, 2024 at 04:22:42PM +0000, Catalin Marinas wrote:
> > On Tue, Oct 29, 2024 at 03:16:00PM +0000, Lorenzo Stoakes wrote:
> > > On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote:
> > > > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote:
> > > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
> > > > > MAP_ANON.
> > > > [...]
> > > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > > index 4ba1d00fabda..e87f5d6799a7 100644
> > > > > --- a/mm/shmem.c
> > > > > +++ b/mm/shmem.c
> > > > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> > > > >  	if (ret)
> > > > >  		return ret;
> > > > >
> > > > > -	/* arm64 - allow memory tagging on RAM-based files */
> > > > > -	vm_flags_set(vma, VM_MTE_ALLOWED);
> > > >
> > > > This breaks arm64 KVM if the VMM uses shared mappings for the memory
> > > > slots (which is possible). We have kvm_vma_mte_allowed() that checks for
> > > > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly.
> > >
> > > Ugh yup missed that thanks.
> > >
> > > > I need to read this thread properly but why not pass the file argument
> > > > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there?
> > >
> > > Can't really do that as it is entangled in a bunch of other stuff,
> > > e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch
> > > of places including arch code and... etc. etc.
> >
> > Not calc_vm_prot_bits() but calc_vm_flag_bits().
> > arch_calc_vm_flag_bits() is only implemented by two architectures -
> > arm64 and parisc and calc_vm_flag_bits() is only called from do_mmap().
> >
> > Basically we want to set VM_MTE_ALLOWED early during the mmap() call
> > and, at the time, my thinking was to do it in calc_vm_flag_bits(). The
> > calc_vm_prot_bits() OTOH is also called on the mprotect() path and is
> > responsible for translating PROT_MTE into a VM_MTE flag without any
> > checks. arch_validate_flags() would check if VM_MTE comes together with
> > VM_MTE_ALLOWED. But, as in the KVM case, that's not the only function
> > checking VM_MTE_ALLOWED.
> >
> > Since calc_vm_flag_bits() did not take a file argument, the lazy
> > approach was to add the flag explicitly for shmem (and hugetlbfs in
> > -next). But I think it would be easier to just add the file argument to
> > calc_vm_flag_bits() and do the check in the arch code to return
> > VM_MTE_ALLOWED. AFAICT, this is called before mmap_region() and
> > arch_validate_flags() (unless I missed something in the recent
> > reworking).
> 
> I mean I totally get why you're suggesting it

Not sure ;)

> - it's the right _place_ but...
> It would require changes to a ton of code which is no good for a backport
> and we don't _need_ to do it.
> 
> I'd rather do the smallest delta at this point, as I am not a huge fan of
> sticking it in here (I mean your point is wholly valid - it's at a better
> place to do so and we can change flags here, it's just - it's not where you
> expect to do this obviously).
> 
> I mean for instance in arch/x86/kernel/cpu/sgx/encl.c (a file I'd _really_
> like us not to touch here by the way) we'd have to what pass NULL?

That's calc_vm_prot_bits(). I suggested calc_vm_flag_bits() (I know,
confusing names and total lack of inspiration when we added MTE
support). The latter is only called in one place - do_mmap().

That's what I meant (untested, on top of -next as it has a MAP_HUGETLB
check in there). I don't think it's much worse than your proposal,
assuming that it works:

diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
index 1dbfb56cb313..358bbaaafd41 100644
--- a/arch/arm64/include/asm/mman.h
+++ b/arch/arm64/include/asm/mman.h
@@ -6,6 +6,8 @@

 #ifndef BUILD_VDSO
 #include <linux/compiler.h>
+#include <linux/fs.h>
+#include <linux/shmem_fs.h>
 #include <linux/types.h>

 static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
@@ -31,7 +33,7 @@ static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
 }
 #define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)

-static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
+static inline unsigned long arch_calc_vm_flag_bits(struct file *file, unsigned long flags)
 {
 	/*
 	 * Only allow MTE on anonymous mappings as these are guaranteed to be
@@ -39,12 +41,12 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
 	 * filesystem supporting MTE (RAM-based).
 	 */
 	if (system_supports_mte() &&
-	    (flags & (MAP_ANONYMOUS | MAP_HUGETLB)))
+	    (flags & (MAP_ANONYMOUS | MAP_HUGETLB) || shmem_file(file)))
 		return VM_MTE_ALLOWED;

 	return 0;
 }
-#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
+#define arch_calc_vm_flag_bits(file, flags) arch_calc_vm_flag_bits(file, flags)

 static inline bool arch_validate_prot(unsigned long prot,
 	unsigned long addr __always_unused)
diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
index 89b6beeda0b8..663f587dc789 100644
--- a/arch/parisc/include/asm/mman.h
+++ b/arch/parisc/include/asm/mman.h
@@ -2,6 +2,7 @@
 #ifndef __ASM_MMAN_H__
 #define __ASM_MMAN_H__

+#include <linux/fs.h>
 #include <uapi/asm/mman.h>

 /* PARISC cannot allow mdwe as it needs writable stacks */
@@ -11,7 +12,7 @@ static inline bool arch_memory_deny_write_exec_supported(void)
 }
 #define arch_memory_deny_write_exec_supported arch_memory_deny_write_exec_supported

-static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
+static inline unsigned long arch_calc_vm_flag_bits(struct file *file, unsigned long flags)
 {
 	/*
 	 * The stack on parisc grows upwards, so if userspace requests memory
@@ -23,6 +24,6 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)

 	return 0;
 }
-#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
+#define arch_calc_vm_flag_bits(file, flags) arch_calc_vm_flag_bits(file, flags)

 #endif /* __ASM_MMAN_H__ */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 8ddca62d6460..a842783ffa62 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_MMAN_H
 #define _LINUX_MMAN_H

+#include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/percpu_counter.h>

@@ -94,7 +95,7 @@ static inline void vm_unacct_memory(long pages)
 #endif

 #ifndef arch_calc_vm_flag_bits
-#define arch_calc_vm_flag_bits(flags) 0
+#define arch_calc_vm_flag_bits(file, flags) 0
 #endif

 #ifndef arch_validate_prot
@@ -151,13 +152,13 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
  * Combine the mmap "flags" argument into "vm_flags" used internally.
  */
 static inline unsigned long
-calc_vm_flag_bits(unsigned long flags)
+calc_vm_flag_bits(struct file *file, unsigned long flags)
 {
 	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
 	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
 	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
 	       _calc_vm_trans(flags, MAP_STACK,	     VM_NOHUGEPAGE) |
-	       arch_calc_vm_flag_bits(flags);
+	       arch_calc_vm_flag_bits(file, flags);
 }

 unsigned long vm_commit_limit(void);
diff --git a/mm/mmap.c b/mm/mmap.c
index f102314bb500..f904b3bba962 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -344,7 +344,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	 * to. we assume access permissions have been handled by the open
 	 * of the memory object, so we don't do any here.
 	 */
-	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(file, flags) |
 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

 	/* Obtain the address to map to. we verify (or select) it and ensure
diff --git a/mm/nommu.c b/mm/nommu.c
index 635d028d647b..e9b5f527ab5b 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -842,7 +842,7 @@ static unsigned long determine_vm_flags(struct file *file,
 {
 	unsigned long vm_flags;

-	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(flags);
+	vm_flags = calc_vm_prot_bits(prot, 0) | calc_vm_flag_bits(file, flags);

 	if (!file) {
 		/*
diff --git a/mm/shmem.c b/mm/shmem.c
index f24a0f34723e..ff194341fddb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2737,9 +2737,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	if (ret)
 		return ret;

-	/* arm64 - allow memory tagging on RAM-based files */
-	vm_flags_set(vma, VM_MTE_ALLOWED);
-
 	file_accessed(file);
 	/* This is anonymous shared memory if it is unlinked at the time of mmap */
 	if (inode->i_nlink)

-- 
Catalin


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-29 17:02                                   ` Catalin Marinas
@ 2024-10-29 17:28                                     ` Lorenzo Stoakes
  2024-10-29 17:32                                       ` Catalin Marinas
  0 siblings, 1 reply; 43+ messages in thread
From: Lorenzo Stoakes @ 2024-10-29 17:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Mark Brown,
	Andrew Morton, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Will Deacon, Aishwarya TCV

On Tue, Oct 29, 2024 at 05:02:23PM +0000, Catalin Marinas wrote:
> On Tue, Oct 29, 2024 at 04:36:32PM +0000, Lorenzo Stoakes wrote:
> > On Tue, Oct 29, 2024 at 04:22:42PM +0000, Catalin Marinas wrote:
> > > On Tue, Oct 29, 2024 at 03:16:00PM +0000, Lorenzo Stoakes wrote:
> > > > On Tue, Oct 29, 2024 at 03:04:41PM +0000, Catalin Marinas wrote:
> > > > > On Mon, Oct 28, 2024 at 10:14:50PM +0000, Lorenzo Stoakes wrote:
> > > > > > So continue to check VM_MTE_ALLOWED which arch_calc_vm_flag_bits() sets if
> > > > > > MAP_ANON.
> > > > > [...]
> > > > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > > > index 4ba1d00fabda..e87f5d6799a7 100644
> > > > > > --- a/mm/shmem.c
> > > > > > +++ b/mm/shmem.c
> > > > > > @@ -2733,9 +2733,6 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> > > > > >  	if (ret)
> > > > > >  		return ret;
> > > > > >
> > > > > > -	/* arm64 - allow memory tagging on RAM-based files */
> > > > > > -	vm_flags_set(vma, VM_MTE_ALLOWED);
> > > > >
> > > > > This breaks arm64 KVM if the VMM uses shared mappings for the memory
> > > > > slots (which is possible). We have kvm_vma_mte_allowed() that checks for
> > > > > the VM_MTE_ALLOWED flag as the VMM may not use PROT_MTE/VM_MTE directly.
> > > >
> > > > Ugh yup missed that thanks.
> > > >
> > > > > I need to read this thread properly but why not pass the file argument
> > > > > to arch_calc_vm_flag_bits() and set VM_MTE_ALLOWED in there?
> > > >
> > > > Can't really do that as it is entangled in a bunch of other stuff,
> > > > e.g. calc_vm_prot_bits() would have to pass file and that's used in a bunch
> > > > of places including arch code and... etc. etc.
> > >
> > > Not calc_vm_prot_bits() but calc_vm_flag_bits().
> > > arch_calc_vm_flag_bits() is only implemented by two architectures -
> > > arm64 and parisc and calc_vm_flag_bits() is only called from do_mmap().
> > >
> > > Basically we want to set VM_MTE_ALLOWED early during the mmap() call
> > > and, at the time, my thinking was to do it in calc_vm_flag_bits(). The
> > > calc_vm_prot_bits() OTOH is also called on the mprotect() path and is
> > > responsible for translating PROT_MTE into a VM_MTE flag without any
> > > checks. arch_validate_flags() would check if VM_MTE comes together with
> > > VM_MTE_ALLOWED. But, as in the KVM case, that's not the only function
> > > checking VM_MTE_ALLOWED.
> > >
> > > Since calc_vm_flag_bits() did not take a file argument, the lazy
> > > approach was to add the flag explicitly for shmem (and hugetlbfs in
> > > -next). But I think it would be easier to just add the file argument to
> > > calc_vm_flag_bits() and do the check in the arch code to return
> > > VM_MTE_ALLOWED. AFAICT, this is called before mmap_region() and
> > > arch_validate_flags() (unless I missed something in the recent
> > > reworking).
> >
> > I mean I totally get why you're suggesting it
>
> Not sure ;)
>
> > - it's the right _place_ but...
> > It would require changes to a ton of code which is no good for a backport
> > and we don't _need_ to do it.
> >
> > I'd rather do the smallest delta at this point, as I am not a huge fan of
> > sticking it in here (I mean your point is wholly valid - it's at a better
> > place to do so and we can change flags here, it's just - it's not where you
> > expect to do this obviously).
> >
> > I mean for instance in arch/x86/kernel/cpu/sgx/encl.c (a file I'd _really_
> > like us not to touch here by the way) we'd have to what pass NULL?
>
> That's calc_vm_prot_bits(). I suggested calc_vm_flag_bits() (I know,
> confusing names and total lack of inspiration when we added MTE
> support). The latter is only called in one place - do_mmap().
>
> That's what I meant (untested, on top of -next as it has a MAP_HUGETLB
> check in there). I don't think it's much worse than your proposal,
> assuming that it works:

Right sorry misread. Yeah this is better, let me do a quick -v4 then!

Cheers, sorry pretty tired at this stage, was looking at this all last
night...


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour
  2024-10-29 17:28                                     ` Lorenzo Stoakes
@ 2024-10-29 17:32                                       ` Catalin Marinas
  0 siblings, 0 replies; 43+ messages in thread
From: Catalin Marinas @ 2024-10-29 17:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Linus Torvalds, Liam R. Howlett, Mark Brown,
	Andrew Morton, Jann Horn, linux-kernel, linux-mm, Peter Xu,
	linux-arm-kernel, Will Deacon, Aishwarya TCV

On Tue, Oct 29, 2024 at 05:28:49PM +0000, Lorenzo Stoakes wrote:
> On Tue, Oct 29, 2024 at 05:02:23PM +0000, Catalin Marinas wrote:
> > That's what I meant (untested, on top of -next as it has a MAP_HUGETLB
> > check in there). I don't think it's much worse than your proposal,
> > assuming that it works:
> 
> Right sorry misread. Yeah this is better, let me do a quick -v4 then!

Thanks!

> Cheers, sorry pretty tired at this stage, was looking at this all last
> night...

No worries. The way we handle MTE is pretty convoluted. Happy to
simplify it if we find a better way.

-- 
Catalin


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2024-10-29 17:33 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-23 20:38 [PATCH v2 0/8] fix error handling in mmap_region() and refactor Lorenzo Stoakes
2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 1/8] mm: avoid unsafe VMA hook invocation when error arises on mmap hook Lorenzo Stoakes
2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 2/8] mm: unconditionally close VMAs on error Lorenzo Stoakes
2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 3/8] mm: refactor map_deny_write_exec() Lorenzo Stoakes
2024-10-23 20:38 ` [PATCH hotfix 6.12 v2 4/8] mm: resolve faulty mmap_region() error path behaviour Lorenzo Stoakes
2024-10-28 18:29   ` Mark Brown
2024-10-28 18:57     ` Lorenzo Stoakes
2024-10-28 19:05       ` Linus Torvalds
2024-10-28 19:14         ` Lorenzo Stoakes
2024-10-28 19:50           ` Liam R. Howlett
2024-10-28 20:00             ` Liam R. Howlett
2024-10-28 20:17               ` Lorenzo Stoakes
2024-10-28 20:22                 ` Linus Torvalds
2024-10-28 20:43                   ` Lorenzo Stoakes
2024-10-28 21:04                     ` Liam R. Howlett
2024-10-28 21:05                     ` Mark Brown
2024-10-28 21:28                       ` Lorenzo Stoakes
2024-10-28 21:00                   ` Vlastimil Babka
2024-10-28 21:19                     ` Linus Torvalds
2024-10-28 21:28                       ` Vlastimil Babka
2024-10-28 22:14                         ` Lorenzo Stoakes
2024-10-29  7:50                           ` Vlastimil Babka
2024-10-29 10:23                             ` Lorenzo Stoakes
2024-10-29 12:33                           ` Mark Brown
2024-10-29 12:41                             ` Lorenzo Stoakes
2024-10-29 15:04                           ` Catalin Marinas
2024-10-29 15:16                             ` Lorenzo Stoakes
2024-10-29 16:22                               ` Catalin Marinas
2024-10-29 16:36                                 ` Lorenzo Stoakes
2024-10-29 17:02                                   ` Catalin Marinas
2024-10-29 17:28                                     ` Lorenzo Stoakes
2024-10-29 17:32                                       ` Catalin Marinas
2024-10-28 20:51       ` Mark Brown
2024-10-23 20:38 ` [PATCH v2 5/8] tools: testing: add additional vma_internal.h stubs Lorenzo Stoakes
2024-10-23 20:38 ` [PATCH v2 6/8] mm: isolate mmap internal logic to mm/vma.c Lorenzo Stoakes
2024-10-24 17:23   ` Vlastimil Babka
2024-10-23 20:38 ` [PATCH v2 7/8] mm: refactor __mmap_region() Lorenzo Stoakes
2024-10-25  8:35   ` Vlastimil Babka
2024-10-25 10:19     ` Lorenzo Stoakes
2024-10-25 10:23       ` Vlastimil Babka
2024-10-23 20:38 ` [PATCH v2 8/8] mm: defer second attempt at merge on mmap() Lorenzo Stoakes
2024-10-25  9:43   ` Vlastimil Babka
2024-10-25 10:20     ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox