linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm: perform guard region install/remove under VMA lock
@ 2025-11-10 17:22 Lorenzo Stoakes
  2025-11-10 17:22 ` [PATCH v2 1/2] mm: rename walk_page_range_mm() Lorenzo Stoakes
  2025-11-10 17:22 ` [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock Lorenzo Stoakes
  0 siblings, 2 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	linux-mm, linux-kernel

There is no reason why can't perform guard region operations under the VMA
lock, as long we take proper precautions to ensure that we do so in a safe
manner.

This is fine, as VMA lock acquisition is always best-effort, so if we are
unable to do so, we can simply fall back to using the mmap read lock.

Doing so will reduce mmap lock contention for callers performing guard
region operations and help establish a precedent of trying to use the VMA
lock where possible.

As part of this change we perform a trivial rename of page walk functions
which bypass safety checks (i.e. whether or not mm_walk_ops->install_pte is
specified) in order that we can keep naming consistent with the mm walk.

This is because we need to expose a VMA-specific walk that still allows us
to install PTE entries.


v2:
* Rename check_ops_valid() to check_ops_safe() as per David.
* Renamed is_vma_lock_valid() to is_vma_lock_sufficient() as per David.

v1:
https://lore.kernel.org/all/cover.1762686301.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (2):
  mm: rename walk_page_range_mm()
  mm/madvise: allow guard page install/remove under VMA lock

 mm/internal.h |   5 ++-
 mm/madvise.c  | 110 ++++++++++++++++++++++++++++++++++++--------------
 mm/pagewalk.c |  37 ++++++++++-------
 3 files changed, 105 insertions(+), 47 deletions(-)

--
2.51.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] mm: rename walk_page_range_mm()
  2025-11-10 17:22 [PATCH v2 0/2] mm: perform guard region install/remove under VMA lock Lorenzo Stoakes
@ 2025-11-10 17:22 ` Lorenzo Stoakes
  2025-11-10 18:27   ` David Hildenbrand (Red Hat)
                     ` (2 more replies)
  2025-11-10 17:22 ` [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock Lorenzo Stoakes
  1 sibling, 3 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	linux-mm, linux-kernel

Make it clear we're referencing an unsafe variant of this function
explicitly.

This is laying the foundation for exposing more such functions and
maintaining a consistent naming scheme.

As a part of this change, rename check_ops_valid() to check_ops_safe() for
consistency.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/internal.h |  2 +-
 mm/madvise.c  |  4 ++--
 mm/pagewalk.c | 22 +++++++++++-----------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0af87f6c2889..479234b39394 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1652,7 +1652,7 @@ static inline void accept_page(struct page *page)
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
 /* pagewalk.c */
-int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
+int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
 int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
diff --git a/mm/madvise.c b/mm/madvise.c
index de918b107cfc..7b938ff44be2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1171,8 +1171,8 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 		unsigned long nr_pages = 0;
 
 		/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
-		err = walk_page_range_mm(vma->vm_mm, range->start, range->end,
-					 &guard_install_walk_ops, &nr_pages);
+		err = walk_page_range_mm_unsafe(vma->vm_mm, range->start,
+				range->end, &guard_install_walk_ops, &nr_pages);
 		if (err < 0)
 			return err;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6cace2c8814a..ab29b16abd2c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -452,7 +452,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
  * We usually restrict the ability to install PTEs, but this functionality is
  * available to internal memory management code and provided in mm/internal.h.
  */
-int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
+int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private)
 {
@@ -518,10 +518,10 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
  * This check is performed on all functions which are parameterised by walk
  * operations and exposed in include/linux/pagewalk.h.
  *
- * Internal memory management code can use the walk_page_range_mm() function to
- * be able to use all page walking operations.
+ * Internal memory management code can use *_unsafe() functions to be able to
+ * use all page walking operations.
  */
-static bool check_ops_valid(const struct mm_walk_ops *ops)
+static bool check_ops_safe(const struct mm_walk_ops *ops)
 {
 	/*
 	 * The installation of PTEs is solely under the control of memory
@@ -579,10 +579,10 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private)
 {
-	if (!check_ops_valid(ops))
+	if (!check_ops_safe(ops))
 		return -EINVAL;
 
-	return walk_page_range_mm(mm, start, end, ops, private);
+	return walk_page_range_mm_unsafe(mm, start, end, ops, private);
 }
 
 /**
@@ -639,7 +639,7 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
 
 	if (start >= end)
 		return -EINVAL;
-	if (!check_ops_valid(ops))
+	if (!check_ops_safe(ops))
 		return -EINVAL;
 
 	return walk_pgd_range(start, end, &walk);
@@ -678,7 +678,7 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 						    pgd, private);
 	if (start >= end || !walk.mm)
 		return -EINVAL;
-	if (!check_ops_valid(ops))
+	if (!check_ops_safe(ops))
 		return -EINVAL;
 
 	/*
@@ -709,7 +709,7 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 		return -EINVAL;
 	if (start < vma->vm_start || end > vma->vm_end)
 		return -EINVAL;
-	if (!check_ops_valid(ops))
+	if (!check_ops_safe(ops))
 		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
@@ -729,7 +729,7 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 
 	if (!walk.mm)
 		return -EINVAL;
-	if (!check_ops_valid(ops))
+	if (!check_ops_safe(ops))
 		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
@@ -780,7 +780,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 	unsigned long start_addr, end_addr;
 	int err = 0;
 
-	if (!check_ops_valid(ops))
+	if (!check_ops_safe(ops))
 		return -EINVAL;
 
 	lockdep_assert_held(&mapping->i_mmap_rwsem);
-- 
2.51.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock
  2025-11-10 17:22 [PATCH v2 0/2] mm: perform guard region install/remove under VMA lock Lorenzo Stoakes
  2025-11-10 17:22 ` [PATCH v2 1/2] mm: rename walk_page_range_mm() Lorenzo Stoakes
@ 2025-11-10 17:22 ` Lorenzo Stoakes
  2025-11-10 18:29   ` David Hildenbrand (Red Hat)
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-11-10 17:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	linux-mm, linux-kernel

We only need to keep the page table stable so we can perform this operation
under the VMA lock. PTE installation is stabilised via the PTE lock.

One caveat is that, if we prepare vma->anon_vma we must hold the mmap read
lock. We can account for this by adapting the VMA locking logic to
explicitly check for this case and prevent a VMA lock from being acquired
should it be the case.

This check is safe, as while we might be raced on anon_vma installation,
this would simply make the check conservative, there's no way for us to see
an anon_vma and then for it to be cleared, as doing so requires the
mmap/VMA write lock.

We abstract the VMA lock validity logic to is_vma_lock_sufficient() for
this purpose, and add prepares_anon_vma() to abstract the anon_vma logic.

In order to do this we need to have a way of installing page tables
explicitly for an identified VMA, so we export walk_page_range_vma() in an
unsafe variant - walk_page_range_vma_unsafe() and use this should the VMA
read lock be taken.

We additionally update the comments in madvise_guard_install() to more
accurately reflect the cases in which the logic may be reattempted,
specifically THP huge pages being present.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/internal.h |   3 ++
 mm/madvise.c  | 110 ++++++++++++++++++++++++++++++++++++--------------
 mm/pagewalk.c |  17 +++++---
 3 files changed, 94 insertions(+), 36 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 479234b39394..3702fe4a4bac 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1655,6 +1655,9 @@ static inline void accept_page(struct page *page)
 int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
+int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		void *private);
 int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
 			  pgd_t *pgd, void *private);
diff --git a/mm/madvise.c b/mm/madvise.c
index 7b938ff44be2..8bf885b00ff8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1120,18 +1120,17 @@ static int guard_install_set_pte(unsigned long addr, unsigned long next,
 	return 0;
 }
 
-static const struct mm_walk_ops guard_install_walk_ops = {
-	.pud_entry		= guard_install_pud_entry,
-	.pmd_entry		= guard_install_pmd_entry,
-	.pte_entry		= guard_install_pte_entry,
-	.install_pte		= guard_install_set_pte,
-	.walk_lock		= PGWALK_RDLOCK,
-};
-
 static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 {
 	struct vm_area_struct *vma = madv_behavior->vma;
 	struct madvise_behavior_range *range = &madv_behavior->range;
+	struct mm_walk_ops walk_ops = {
+		.pud_entry	= guard_install_pud_entry,
+		.pmd_entry	= guard_install_pmd_entry,
+		.pte_entry	= guard_install_pte_entry,
+		.install_pte	= guard_install_set_pte,
+		.walk_lock	= get_walk_lock(madv_behavior->lock_mode),
+	};
 	long err;
 	int i;
 
@@ -1148,8 +1147,14 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 	/*
 	 * If anonymous and we are establishing page tables the VMA ought to
 	 * have an anon_vma associated with it.
+	 *
+	 * We will hold an mmap read lock if this is necessary, this is checked
+	 * as part of the VMA lock logic.
 	 */
 	if (vma_is_anonymous(vma)) {
+		VM_WARN_ON_ONCE(!vma->anon_vma &&
+				madv_behavior->lock_mode != MADVISE_MMAP_READ_LOCK);
+
 		err = anon_vma_prepare(vma);
 		if (err)
 			return err;
@@ -1157,12 +1162,14 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 
 	/*
 	 * Optimistically try to install the guard marker pages first. If any
-	 * non-guard pages are encountered, give up and zap the range before
-	 * trying again.
+	 * non-guard pages or THP huge pages are encountered, give up and zap
+	 * the range before trying again.
 	 *
 	 * We try a few times before giving up and releasing back to userland to
-	 * loop around, releasing locks in the process to avoid contention. This
-	 * would only happen if there was a great many racing page faults.
+	 * loop around, releasing locks in the process to avoid contention.
+	 *
+	 * This would only happen due to races with e.g. page faults or
+	 * khugepaged.
 	 *
 	 * In most cases we should simply install the guard markers immediately
 	 * with no zap or looping.
@@ -1171,8 +1178,13 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 		unsigned long nr_pages = 0;
 
 		/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
-		err = walk_page_range_mm_unsafe(vma->vm_mm, range->start,
-				range->end, &guard_install_walk_ops, &nr_pages);
+		if (madv_behavior->lock_mode == MADVISE_VMA_READ_LOCK)
+			err = walk_page_range_vma_unsafe(madv_behavior->vma,
+					range->start, range->end, &walk_ops,
+					&nr_pages);
+		else
+			err = walk_page_range_mm_unsafe(vma->vm_mm, range->start,
+					range->end, &walk_ops, &nr_pages);
 		if (err < 0)
 			return err;
 
@@ -1193,8 +1205,7 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 	}
 
 	/*
-	 * We were unable to install the guard pages due to being raced by page
-	 * faults. This should not happen ordinarily. We return to userspace and
+	 * We were unable to install the guard pages, return to userspace and
 	 * immediately retry, relieving lock contention.
 	 */
 	return restart_syscall();
@@ -1238,17 +1249,16 @@ static int guard_remove_pte_entry(pte_t *pte, unsigned long addr,
 	return 0;
 }
 
-static const struct mm_walk_ops guard_remove_walk_ops = {
-	.pud_entry		= guard_remove_pud_entry,
-	.pmd_entry		= guard_remove_pmd_entry,
-	.pte_entry		= guard_remove_pte_entry,
-	.walk_lock		= PGWALK_RDLOCK,
-};
-
 static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
 {
 	struct vm_area_struct *vma = madv_behavior->vma;
 	struct madvise_behavior_range *range = &madv_behavior->range;
+	struct mm_walk_ops wallk_ops = {
+		.pud_entry = guard_remove_pud_entry,
+		.pmd_entry = guard_remove_pmd_entry,
+		.pte_entry = guard_remove_pte_entry,
+		.walk_lock = get_walk_lock(madv_behavior->lock_mode),
+	};
 
 	/*
 	 * We're ok with removing guards in mlock()'d ranges, as this is a
@@ -1258,7 +1268,7 @@ static long madvise_guard_remove(struct madvise_behavior *madv_behavior)
 		return -EINVAL;
 
 	return walk_page_range_vma(vma, range->start, range->end,
-			       &guard_remove_walk_ops, NULL);
+				   &wallk_ops, NULL);
 }
 
 #ifdef CONFIG_64BIT
@@ -1571,6 +1581,47 @@ static bool process_madvise_remote_valid(int behavior)
 	}
 }
 
+/* Does this operation invoke anon_vma_prepare()? */
+static bool prepares_anon_vma(int behavior)
+{
+	switch (behavior) {
+	case MADV_GUARD_INSTALL:
+		return true;
+	default:
+		return false;
+	}
+}
+
+/*
+ * We have acquired a VMA read lock, is the VMA valid to be madvise'd under VMA
+ * read lock only now we have a VMA to examine?
+ */
+static bool is_vma_lock_sufficient(struct vm_area_struct *vma,
+		struct madvise_behavior *madv_behavior)
+{
+	/* Must span only a single VMA.*/
+	if (madv_behavior->range.end > vma->vm_end)
+		return false;
+	/* Remote processes unsupported. */
+	if (current->mm != vma->vm_mm)
+		return false;
+	/* Userfaultfd unsupported. */
+	if (userfaultfd_armed(vma))
+		return false;
+	/*
+	 * anon_vma_prepare() explicitly requires an mmap lock for
+	 * serialisation, so we cannot use a VMA lock in this case.
+	 *
+	 * Note we might race with anon_vma being set, however this makes this
+	 * check overly paranoid which is safe.
+	 */
+	if (vma_is_anonymous(vma) &&
+	    prepares_anon_vma(madv_behavior->behavior) && !vma->anon_vma)
+		return false;
+
+	return true;
+}
+
 /*
  * Try to acquire a VMA read lock if possible.
  *
@@ -1592,15 +1643,12 @@ static bool try_vma_read_lock(struct madvise_behavior *madv_behavior)
 	vma = lock_vma_under_rcu(mm, madv_behavior->range.start);
 	if (!vma)
 		goto take_mmap_read_lock;
-	/*
-	 * Must span only a single VMA; uffd and remote processes are
-	 * unsupported.
-	 */
-	if (madv_behavior->range.end > vma->vm_end || current->mm != mm ||
-	    userfaultfd_armed(vma)) {
+
+	if (!is_vma_lock_sufficient(vma, madv_behavior)) {
 		vma_end_read(vma);
 		goto take_mmap_read_lock;
 	}
+
 	madv_behavior->vma = vma;
 	return true;
 
@@ -1713,9 +1761,9 @@ static enum madvise_lock_mode get_lock_mode(struct madvise_behavior *madv_behavi
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
 	case MADV_COLLAPSE:
+		return MADVISE_MMAP_READ_LOCK;
 	case MADV_GUARD_INSTALL:
 	case MADV_GUARD_REMOVE:
-		return MADVISE_MMAP_READ_LOCK;
 	case MADV_DONTNEED:
 	case MADV_DONTNEED_LOCKED:
 	case MADV_FREE:
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index ab29b16abd2c..9ddb10359fa9 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -694,9 +694,8 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 	return walk_pgd_range(start, end, &walk);
 }
 
-int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
-			unsigned long end, const struct mm_walk_ops *ops,
-			void *private)
+int walk_page_range_vma_unsafe(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops, void *private)
 {
 	struct mm_walk walk = {
 		.ops		= ops,
@@ -709,14 +708,22 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 		return -EINVAL;
 	if (start < vma->vm_start || end > vma->vm_end)
 		return -EINVAL;
-	if (!check_ops_safe(ops))
-		return -EINVAL;
 
 	process_mm_walk_lock(walk.mm, ops->walk_lock);
 	process_vma_walk_lock(vma, ops->walk_lock);
 	return __walk_page_range(start, end, &walk);
 }
 
+int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, const struct mm_walk_ops *ops,
+			void *private)
+{
+	if (!check_ops_safe(ops))
+		return -EINVAL;
+
+	return walk_page_range_vma_unsafe(vma, start, end, ops, private);
+}
+
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		void *private)
 {
-- 
2.51.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: rename walk_page_range_mm()
  2025-11-10 17:22 ` [PATCH v2 1/2] mm: rename walk_page_range_mm() Lorenzo Stoakes
@ 2025-11-10 18:27   ` David Hildenbrand (Red Hat)
  2025-11-11  7:56   ` Vlastimil Babka
  2025-11-11 17:12   ` Davidlohr Bueso
  2 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-10 18:27 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, linux-mm,
	linux-kernel

On 10.11.25 18:22, Lorenzo Stoakes wrote:
> Make it clear we're referencing an unsafe variant of this function
> explicitly.
> 
> This is laying the foundation for exposing more such functions and
> maintaining a consistent naming scheme.
> 
> As a part of this change, rename check_ops_valid() to check_ops_safe() for
> consistency.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock
  2025-11-10 17:22 ` [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock Lorenzo Stoakes
@ 2025-11-10 18:29   ` David Hildenbrand (Red Hat)
  2025-11-11  8:12   ` Vlastimil Babka
  2025-11-11 17:26   ` Davidlohr Bueso
  2 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-10 18:29 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, linux-mm,
	linux-kernel

On 10.11.25 18:22, Lorenzo Stoakes wrote:
> We only need to keep the page table stable so we can perform this operation
> under the VMA lock. PTE installation is stabilised via the PTE lock.
> 
> One caveat is that, if we prepare vma->anon_vma we must hold the mmap read
> lock. We can account for this by adapting the VMA locking logic to
> explicitly check for this case and prevent a VMA lock from being acquired
> should it be the case.
> 
> This check is safe, as while we might be raced on anon_vma installation,
> this would simply make the check conservative, there's no way for us to see
> an anon_vma and then for it to be cleared, as doing so requires the
> mmap/VMA write lock.
> 
> We abstract the VMA lock validity logic to is_vma_lock_sufficient() for
> this purpose, and add prepares_anon_vma() to abstract the anon_vma logic.
> 
> In order to do this we need to have a way of installing page tables
> explicitly for an identified VMA, so we export walk_page_range_vma() in an
> unsafe variant - walk_page_range_vma_unsafe() and use this should the VMA
> read lock be taken.
> 
> We additionally update the comments in madvise_guard_install() to more
> accurately reflect the cases in which the logic may be reattempted,
> specifically THP huge pages being present.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

Looks good to me

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: rename walk_page_range_mm()
  2025-11-10 17:22 ` [PATCH v2 1/2] mm: rename walk_page_range_mm() Lorenzo Stoakes
  2025-11-10 18:27   ` David Hildenbrand (Red Hat)
@ 2025-11-11  7:56   ` Vlastimil Babka
  2025-11-11 17:12   ` Davidlohr Bueso
  2 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2025-11-11  7:56 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: David Hildenbrand, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, linux-mm,
	linux-kernel

On 11/10/25 18:22, Lorenzo Stoakes wrote:
> Make it clear we're referencing an unsafe variant of this function
> explicitly.
> 
> This is laying the foundation for exposing more such functions and
> maintaining a consistent naming scheme.
> 
> As a part of this change, rename check_ops_valid() to check_ops_safe() for
> consistency.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/internal.h |  2 +-
>  mm/madvise.c  |  4 ++--
>  mm/pagewalk.c | 22 +++++++++++-----------
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 0af87f6c2889..479234b39394 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1652,7 +1652,7 @@ static inline void accept_page(struct page *page)
>  #endif /* CONFIG_UNACCEPTED_MEMORY */
>  
>  /* pagewalk.c */
> -int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> +int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
>  		unsigned long end, const struct mm_walk_ops *ops,
>  		void *private);
>  int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
> diff --git a/mm/madvise.c b/mm/madvise.c
> index de918b107cfc..7b938ff44be2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1171,8 +1171,8 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
>  		unsigned long nr_pages = 0;
>  
>  		/* Returns < 0 on error, == 0 if success, > 0 if zap needed. */
> -		err = walk_page_range_mm(vma->vm_mm, range->start, range->end,
> -					 &guard_install_walk_ops, &nr_pages);
> +		err = walk_page_range_mm_unsafe(vma->vm_mm, range->start,
> +				range->end, &guard_install_walk_ops, &nr_pages);
>  		if (err < 0)
>  			return err;
>  
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 6cace2c8814a..ab29b16abd2c 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -452,7 +452,7 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>   * We usually restrict the ability to install PTEs, but this functionality is
>   * available to internal memory management code and provided in mm/internal.h.
>   */
> -int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> +int walk_page_range_mm_unsafe(struct mm_struct *mm, unsigned long start,
>  		unsigned long end, const struct mm_walk_ops *ops,
>  		void *private)
>  {
> @@ -518,10 +518,10 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>   * This check is performed on all functions which are parameterised by walk
>   * operations and exposed in include/linux/pagewalk.h.
>   *
> - * Internal memory management code can use the walk_page_range_mm() function to
> - * be able to use all page walking operations.
> + * Internal memory management code can use *_unsafe() functions to be able to
> + * use all page walking operations.
>   */
> -static bool check_ops_valid(const struct mm_walk_ops *ops)
> +static bool check_ops_safe(const struct mm_walk_ops *ops)
>  {
>  	/*
>  	 * The installation of PTEs is solely under the control of memory
> @@ -579,10 +579,10 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>  		unsigned long end, const struct mm_walk_ops *ops,
>  		void *private)
>  {
> -	if (!check_ops_valid(ops))
> +	if (!check_ops_safe(ops))
>  		return -EINVAL;
>  
> -	return walk_page_range_mm(mm, start, end, ops, private);
> +	return walk_page_range_mm_unsafe(mm, start, end, ops, private);
>  }
>  
>  /**
> @@ -639,7 +639,7 @@ int walk_kernel_page_table_range_lockless(unsigned long start, unsigned long end
>  
>  	if (start >= end)
>  		return -EINVAL;
> -	if (!check_ops_valid(ops))
> +	if (!check_ops_safe(ops))
>  		return -EINVAL;
>  
>  	return walk_pgd_range(start, end, &walk);
> @@ -678,7 +678,7 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
>  						    pgd, private);
>  	if (start >= end || !walk.mm)
>  		return -EINVAL;
> -	if (!check_ops_valid(ops))
> +	if (!check_ops_safe(ops))
>  		return -EINVAL;
>  
>  	/*
> @@ -709,7 +709,7 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>  		return -EINVAL;
>  	if (start < vma->vm_start || end > vma->vm_end)
>  		return -EINVAL;
> -	if (!check_ops_valid(ops))
> +	if (!check_ops_safe(ops))
>  		return -EINVAL;
>  
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
> @@ -729,7 +729,7 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>  
>  	if (!walk.mm)
>  		return -EINVAL;
> -	if (!check_ops_valid(ops))
> +	if (!check_ops_safe(ops))
>  		return -EINVAL;
>  
>  	process_mm_walk_lock(walk.mm, ops->walk_lock);
> @@ -780,7 +780,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
>  	unsigned long start_addr, end_addr;
>  	int err = 0;
>  
> -	if (!check_ops_valid(ops))
> +	if (!check_ops_safe(ops))
>  		return -EINVAL;
>  
>  	lockdep_assert_held(&mapping->i_mmap_rwsem);



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock
  2025-11-10 17:22 ` [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock Lorenzo Stoakes
  2025-11-10 18:29   ` David Hildenbrand (Red Hat)
@ 2025-11-11  8:12   ` Vlastimil Babka
  2025-11-11 17:26   ` Davidlohr Bueso
  2 siblings, 0 replies; 9+ messages in thread
From: Vlastimil Babka @ 2025-11-11  8:12 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: David Hildenbrand, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, linux-mm,
	linux-kernel

On 11/10/25 18:22, Lorenzo Stoakes wrote:
> We only need to keep the page table stable so we can perform this operation
> under the VMA lock. PTE installation is stabilised via the PTE lock.
> 
> One caveat is that, if we prepare vma->anon_vma we must hold the mmap read
> lock. We can account for this by adapting the VMA locking logic to
> explicitly check for this case and prevent a VMA lock from being acquired
> should it be the case.
> 
> This check is safe, as while we might be raced on anon_vma installation,
> this would simply make the check conservative, there's no way for us to see
> an anon_vma and then for it to be cleared, as doing so requires the
> mmap/VMA write lock.
> 
> We abstract the VMA lock validity logic to is_vma_lock_sufficient() for
> this purpose, and add prepares_anon_vma() to abstract the anon_vma logic.
> 
> In order to do this we need to have a way of installing page tables
> explicitly for an identified VMA, so we export walk_page_range_vma() in an
> unsafe variant - walk_page_range_vma_unsafe() and use this should the VMA
> read lock be taken.
> 
> We additionally update the comments in madvise_guard_install() to more
> accurately reflect the cases in which the logic may be reattempted,
> specifically THP huge pages being present.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] mm: rename walk_page_range_mm()
  2025-11-10 17:22 ` [PATCH v2 1/2] mm: rename walk_page_range_mm() Lorenzo Stoakes
  2025-11-10 18:27   ` David Hildenbrand (Red Hat)
  2025-11-11  7:56   ` Vlastimil Babka
@ 2025-11-11 17:12   ` Davidlohr Bueso
  2 siblings, 0 replies; 9+ messages in thread
From: Davidlohr Bueso @ 2025-11-11 17:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, linux-mm, linux-kernel

On Mon, 10 Nov 2025, Lorenzo Stoakes wrote:

>Make it clear we're referencing an unsafe variant of this function
>explicitly.
>
>This is laying the foundation for exposing more such functions and
>maintaining a consistent naming scheme.
>
>As a part of this change, rename check_ops_valid() to check_ops_safe() for
>consistency.
>
>Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock
  2025-11-10 17:22 ` [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock Lorenzo Stoakes
  2025-11-10 18:29   ` David Hildenbrand (Red Hat)
  2025-11-11  8:12   ` Vlastimil Babka
@ 2025-11-11 17:26   ` Davidlohr Bueso
  2 siblings, 0 replies; 9+ messages in thread
From: Davidlohr Bueso @ 2025-11-11 17:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, linux-mm, linux-kernel

On Mon, 10 Nov 2025, Lorenzo Stoakes wrote:

>We only need to keep the page table stable so we can perform this operation
>under the VMA lock. PTE installation is stabilised via the PTE lock.
>
>One caveat is that, if we prepare vma->anon_vma we must hold the mmap read
>lock. We can account for this by adapting the VMA locking logic to
>explicitly check for this case and prevent a VMA lock from being acquired
>should it be the case.
>
>This check is safe, as while we might be raced on anon_vma installation,
>this would simply make the check conservative, there's no way for us to see
>an anon_vma and then for it to be cleared, as doing so requires the
>mmap/VMA write lock.
>
>We abstract the VMA lock validity logic to is_vma_lock_sufficient() for
>this purpose, and add prepares_anon_vma() to abstract the anon_vma logic.
>
>In order to do this we need to have a way of installing page tables
>explicitly for an identified VMA, so we export walk_page_range_vma() in an
>unsafe variant - walk_page_range_vma_unsafe() and use this should the VMA
>read lock be taken.
>
>We additionally update the comments in madvise_guard_install() to more
>accurately reflect the cases in which the logic may be reattempted,
>specifically THP huge pages being present.

Makes sense.

Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-11-11 17:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-10 17:22 [PATCH v2 0/2] mm: perform guard region install/remove under VMA lock Lorenzo Stoakes
2025-11-10 17:22 ` [PATCH v2 1/2] mm: rename walk_page_range_mm() Lorenzo Stoakes
2025-11-10 18:27   ` David Hildenbrand (Red Hat)
2025-11-11  7:56   ` Vlastimil Babka
2025-11-11 17:12   ` Davidlohr Bueso
2025-11-10 17:22 ` [PATCH v2 2/2] mm/madvise: allow guard page install/remove under VMA lock Lorenzo Stoakes
2025-11-10 18:29   ` David Hildenbrand (Red Hat)
2025-11-11  8:12   ` Vlastimil Babka
2025-11-11 17:26   ` Davidlohr Bueso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox