[PATCH v2 0/5] Fix lazy mmu mode

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/5] Fix lazy mmu mode
@ 2025-03-03 14:15 Ryan Roberts
  2025-03-03 14:15 ` [PATCH v2 1/5] mm: Fix lazy mmu docs and usage Ryan Roberts
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-03-03 14:15 UTC (permalink / raw)
  To: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas
  Cc: Ryan Roberts, linux-mm, sparclinux, xen-devel, linux-kernel

Hi All,

I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
walkers. While lazy mmu mode is already used for kernel mappings in a few
places, this will extend it's use significantly.

Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
it looks like there are a bunch of bugs, some of which may be more likely to
trigger once I extend the use of lazy mmu. So this series attempts to clarify
the requirements and fix all the bugs in advance of that series. See patch #1
commit log for all the details.

Note that I have only been able to compile test these changes but I think they
are in good enough shape for some linux-next testing.

Applies on Friday's mm-unstable (5f089a9aa987), as I assume this would be
preferred via that tree.

Changes since v1
================
  - split v1 patch #1 into v2 patch #1 and #2; per David
  - Added Acked-by tags from David and Andreas; Thanks!
  - Refined the patches which are truely fixes and added to stable to cc

Thanks,
Ryan

Ryan Roberts (5):
  mm: Fix lazy mmu docs and usage
  fs/proc/task_mmu: Reduce scope of lazy mmu region
  sparc/mm: Disable preemption in lazy mmu mode
  sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
  Revert "x86/xen: allow nesting of same lazy mode"

 arch/sparc/include/asm/pgtable_64.h   |  2 --
 arch/sparc/mm/tlb.c                   |  5 ++++-
 arch/x86/include/asm/xen/hypervisor.h | 15 ++-------------
 arch/x86/xen/enlighten_pv.c           |  1 -
 fs/proc/task_mmu.c                    | 11 ++++-------
 include/linux/pgtable.h               | 14 ++++++++------
 6 files changed, 18 insertions(+), 30 deletions(-)

--
2.43.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/5] mm: Fix lazy mmu docs and usage
  2025-03-03 14:15 [PATCH v2 0/5] Fix lazy mmu mode Ryan Roberts
@ 2025-03-03 14:15 ` Ryan Roberts
  2025-03-03 14:15 ` [PATCH v2 2/5] fs/proc/task_mmu: Reduce scope of lazy mmu region Ryan Roberts
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-03-03 14:15 UTC (permalink / raw)
  To: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas
  Cc: Ryan Roberts, linux-mm, sparclinux, xen-devel, linux-kernel,
	stable, David Hildenbrand

The docs, implementations and use of arch_[enter|leave]_lazy_mmu_mode()
is a bit of a mess (to put it politely). There are a number of issues
related to nesting of lazy mmu regions and confusion over whether the
task, when in a lazy mmu region, is preemptible or not. Fix all the
issues relating to the core-mm. Follow up commits will fix the
arch-specific implementations. 3 arches implement lazy mmu; powerpc,
sparc and x86.

When arch_[enter|leave]_lazy_mmu_mode() was first introduced by commit
6606c3e0da53 ("[PATCH] paravirt: lazy mmu mode hooks.patch"), it was
expected that lazy mmu regions would never nest and that the appropriate
page table lock(s) would be held while in the region, thus ensuring the
region is non-preemptible. Additionally lazy mmu regions were only used
during manipulation of user mappings.

Commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy
updates") started invoking the lazy mmu mode in apply_to_pte_range(),
which is used for both user and kernel mappings. For kernel mappings the
region is no longer protected by any lock so there is no longer any
guarantee about non-preemptibility. Additionally, for RT configs, the
holding the PTL only implies no CPU migration, it doesn't prevent
preemption.

Commit bcc6cc832573 ("mm: add default definition of set_ptes()") added
arch_[enter|leave]_lazy_mmu_mode() to the default implementation of
set_ptes(), used by x86. So after this commit, lazy mmu regions can be
nested. Additionally commit 1a10a44dfc1d ("sparc64: implement the new
page table range API") and commit 9fee28baa601 ("powerpc: implement the
new page table range API") did the same for the sparc and powerpc
set_ptes() overrides.

powerpc couldn't deal with preemption so avoids it in commit
b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu mode"),
which explicitly disables preemption for the whole region in its
implementation. x86 can support preemption (or at least it could until
it tried to add support nesting; more on this below). Sparc looks to be
totally broken in the face of preemption, as far as I can tell.

powerpc can't deal with nesting, so avoids it in commit 47b8def9358c
("powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes"),
which removes the lazy mmu calls from its implementation of set_ptes().
x86 attempted to support nesting in commit 49147beb0ccb ("x86/xen: allow
nesting of same lazy mode") but as far as I can tell, this breaks its
support for preemption.

In short, it's all a mess; the semantics for
arch_[enter|leave]_lazy_mmu_mode() are not clearly defined and as a
result the implementations all have different expectations, sticking
plasters and bugs.

arm64 is aiming to start using these hooks, so let's clean everything up
before adding an arm64 implementation. Update the documentation to state
that lazy mmu regions can never be nested, must not be called in
interrupt context and preemption may or may not be enabled for the
duration of the region. And fix the generic implementation of set_ptes()
to avoid nesting.

arch-specific fixes to conform to the new spec will proceed this one.

These issues were spotted by code review and I have no evidence of
issues being reported in the wild.

Cc: <stable@vger.kernel.org>
Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()")
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/pgtable.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94d267d02372..787c632ee2c9 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -222,10 +222,14 @@ static inline int pmd_dirty(pmd_t pmd)
  * hazard could result in the direct mode hypervisor case, since the actual
  * write to the page tables may not yet have taken place, so reads though
  * a raw PTE pointer after it has been modified are not guaranteed to be
- * up to date.  This mode can only be entered and left under the protection of
- * the page table locks for all page tables which may be modified.  In the UP
- * case, this is required so that preemption is disabled, and in the SMP case,
- * it must synchronize the delayed page table writes properly on other CPUs.
+ * up to date.
+ *
+ * In the general case, no lock is guaranteed to be held between entry and exit
+ * of the lazy mode. So the implementation must assume preemption may be enabled
+ * and cpu migration is possible; it must take steps to be robust against this.
+ * (In practice, for user PTE updates, the appropriate page table lock(s) are
+ * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
+ * and the mode cannot be used in interrupt context.
  */
 #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 #define arch_enter_lazy_mmu_mode()	do {} while (0)
@@ -287,7 +291,6 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 {
 	page_table_check_ptes_set(mm, ptep, pte, nr);

-	arch_enter_lazy_mmu_mode();
 	for (;;) {
 		set_pte(ptep, pte);
 		if (--nr == 0)
@@ -295,7 +298,6 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		ptep++;
 		pte = pte_next_pfn(pte);
 	}
-	arch_leave_lazy_mmu_mode();
 }
 #endif
 #define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-- 
2.43.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 2/5] fs/proc/task_mmu: Reduce scope of lazy mmu region
  2025-03-03 14:15 [PATCH v2 0/5] Fix lazy mmu mode Ryan Roberts
  2025-03-03 14:15 ` [PATCH v2 1/5] mm: Fix lazy mmu docs and usage Ryan Roberts
@ 2025-03-03 14:15 ` Ryan Roberts
  2025-03-03 14:15 ` [PATCH v2 3/5] sparc/mm: Disable preemption in lazy mmu mode Ryan Roberts
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-03-03 14:15 UTC (permalink / raw)
  To: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas
  Cc: Ryan Roberts, linux-mm, sparclinux, xen-devel, linux-kernel,
	David Hildenbrand

Update the way arch_[enter|leave]_lazy_mmu_mode() is called in
pagemap_scan_pmd_entry() to follow the normal pattern of holding the ptl
for user space mappings. As a result the scope is reduced to only the
pte table, but that's where most of the performance win is.

While I believe there wasn't technically a bug here, the original scope
made it easier to accidentally nest or, worse, accidentally call
something like kmap() which would expect an immediate mode pte
modification but it would end up deferred.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 fs/proc/task_mmu.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c17615e21a5d..b0f189815512 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2459,22 +2459,19 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 	spinlock_t *ptl;
 	int ret;
 
-	arch_enter_lazy_mmu_mode();
-
 	ret = pagemap_scan_thp_entry(pmd, start, end, walk);
-	if (ret != -ENOENT) {
-		arch_leave_lazy_mmu_mode();
+	if (ret != -ENOENT)
 		return ret;
-	}
 
 	ret = 0;
 	start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
 	if (!pte) {
-		arch_leave_lazy_mmu_mode();
 		walk->action = ACTION_AGAIN;
 		return 0;
 	}
 
+	arch_enter_lazy_mmu_mode();
+
 	if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
 		/* Fast path for performing exclusive WP */
 		for (addr = start; addr != end; pte++, addr += PAGE_SIZE) {
@@ -2543,8 +2540,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
 	if (flush_end)
 		flush_tlb_range(vma, start, addr);
 
-	pte_unmap_unlock(start_pte, ptl);
 	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(start_pte, ptl);
 
 	cond_resched();
 	return ret;
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 3/5] sparc/mm: Disable preemption in lazy mmu mode
  2025-03-03 14:15 [PATCH v2 0/5] Fix lazy mmu mode Ryan Roberts
  2025-03-03 14:15 ` [PATCH v2 1/5] mm: Fix lazy mmu docs and usage Ryan Roberts
  2025-03-03 14:15 ` [PATCH v2 2/5] fs/proc/task_mmu: Reduce scope of lazy mmu region Ryan Roberts
@ 2025-03-03 14:15 ` Ryan Roberts
  2025-03-03 14:15 ` [PATCH v2 4/5] sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes Ryan Roberts
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-03-03 14:15 UTC (permalink / raw)
  To: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas
  Cc: Ryan Roberts, linux-mm, sparclinux, xen-devel, linux-kernel,
	stable, David Hildenbrand

Since commit 38e0edb15bd0 ("mm/apply_to_range: call pte function with
lazy updates") it's been possible for arch_[enter|leave]_lazy_mmu_mode()
to be called without holding a page table lock (for the kernel mappings
case), and therefore it is possible that preemption may occur while in
the lazy mmu mode. The Sparc lazy mmu implementation is not robust to
preemption since it stores the lazy mode state in a per-cpu structure
and does not attempt to manage that state on task switch.

Powerpc had the same issue and fixed it by explicitly disabling
preemption in arch_enter_lazy_mmu_mode() and re-enabling in
arch_leave_lazy_mmu_mode(). See commit b9ef323ea168 ("powerpc/64s:
Disable preemption in hash lazy mmu mode").

Given Sparc's lazy mmu mode is based on powerpc's, let's fix it in the
same way here.

Cc: <stable@vger.kernel.org>
Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates")
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/sparc/mm/tlb.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 8648a50afe88..a35ddcca5e76 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -52,8 +52,10 @@ void flush_tlb_pending(void)
 
 void arch_enter_lazy_mmu_mode(void)
 {
-	struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
+	struct tlb_batch *tb;
 
+	preempt_disable();
+	tb = this_cpu_ptr(&tlb_batch);
 	tb->active = 1;
 }
 
@@ -64,6 +66,7 @@ void arch_leave_lazy_mmu_mode(void)
 	if (tb->tlb_nr)
 		flush_tlb_pending();
 	tb->active = 0;
+	preempt_enable();
 }
 
 static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 4/5] sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
  2025-03-03 14:15 [PATCH v2 0/5] Fix lazy mmu mode Ryan Roberts
                   ` (2 preceding siblings ...)
  2025-03-03 14:15 ` [PATCH v2 3/5] sparc/mm: Disable preemption in lazy mmu mode Ryan Roberts
@ 2025-03-03 14:15 ` Ryan Roberts
  2025-03-03 14:15 ` [PATCH v2 5/5] Revert "x86/xen: allow nesting of same lazy mode" Ryan Roberts
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-03-03 14:15 UTC (permalink / raw)
  To: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas
  Cc: Ryan Roberts, linux-mm, sparclinux, xen-devel, linux-kernel,
	stable, David Hildenbrand

With commit 1a10a44dfc1d ("sparc64: implement the new page table range
API") set_ptes was added to the sparc architecture. The implementation
included calling arch_enter/leave_lazy_mmu() calls.

The patch removes the usage of arch_enter/leave_lazy_mmu() since this
implies nesting of lazy mmu regions which is not supported. Without this
fix, lazy mmu mode is effectively disabled because we exit the mode
after the first set_ptes:

remap_pte_range()
  -> arch_enter_lazy_mmu()
  -> set_ptes()
      -> arch_enter_lazy_mmu()
      -> arch_leave_lazy_mmu()
  -> arch_leave_lazy_mmu()

Powerpc suffered the same problem and fixed it in a corresponding way
with commit 47b8def9358c ("powerpc/mm: Avoid calling
arch_enter/leave_lazy_mmu() in set_ptes").

Cc: <stable@vger.kernel.org>
Fixes: 1a10a44dfc1d ("sparc64: implement the new page table range API")
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/sparc/include/asm/pgtable_64.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 2b7f358762c1..dc28f2c4eee3 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -936,7 +936,6 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		pte_t *ptep, pte_t pte, unsigned int nr)
 {
-	arch_enter_lazy_mmu_mode();
 	for (;;) {
 		__set_pte_at(mm, addr, ptep, pte, 0);
 		if (--nr == 0)
@@ -945,7 +944,6 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		pte_val(pte) += PAGE_SIZE;
 		addr += PAGE_SIZE;
 	}
-	arch_leave_lazy_mmu_mode();
 }
 #define set_ptes set_ptes
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 5/5] Revert "x86/xen: allow nesting of same lazy mode"
  2025-03-03 14:15 [PATCH v2 0/5] Fix lazy mmu mode Ryan Roberts
                   ` (3 preceding siblings ...)
  2025-03-03 14:15 ` [PATCH v2 4/5] sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes Ryan Roberts
@ 2025-03-03 14:15 ` Ryan Roberts
  2025-03-03 14:36 ` [PATCH v2 0/5] Fix lazy mmu mode Jürgen Groß
  2025-04-10 16:07 ` Alexander Gordeev
  6 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-03-03 14:15 UTC (permalink / raw)
  To: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas
  Cc: Ryan Roberts, linux-mm, sparclinux, xen-devel, linux-kernel,
	David Hildenbrand

Commit 49147beb0ccb ("x86/xen: allow nesting of same lazy mode") was
added as a solution for a core-mm code change where
arch_[enter|leave]_lazy_mmu_mode() started to be called in a nested
manner; see commit bcc6cc832573 ("mm: add default definition of
set_ptes()").

However, now that we have fixed the API to avoid nesting, we no longer
need this capability in the x86 implementation.

Additionally, from code review, I don't believe the fix was ever robust
in the case of preemption occurring while in the nested lazy mode. The
implementation usually deals with preemption by calling
arch_leave_lazy_mmu_mode() from xen_start_context_switch() for the
outgoing task if we are in the lazy mmu mode. Then in
xen_end_context_switch(), it restarts the lazy mode by calling
arch_enter_lazy_mmu_mode() for an incoming task that was in the lazy
mode when it was switched out. But arch_leave_lazy_mmu_mode() will only
unwind a single level of nesting. If we are in the double nest, then
it's not fully unwound and per-cpu variables are left in a bad state.

So the correct solution is to remove the possibility of nesting from the
higher level (which has now been done) and remove this x86-specific
solution.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/x86/include/asm/xen/hypervisor.h | 15 ++-------------
 arch/x86/xen/enlighten_pv.c           |  1 -
 2 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/xen/hypervisor.h b/arch/x86/include/asm/xen/hypervisor.h
index a9088250770f..bd0fc69a10a7 100644
--- a/arch/x86/include/asm/xen/hypervisor.h
+++ b/arch/x86/include/asm/xen/hypervisor.h
@@ -72,18 +72,10 @@ enum xen_lazy_mode {
 };

 DECLARE_PER_CPU(enum xen_lazy_mode, xen_lazy_mode);
-DECLARE_PER_CPU(unsigned int, xen_lazy_nesting);

 static inline void enter_lazy(enum xen_lazy_mode mode)
 {
-	enum xen_lazy_mode old_mode = this_cpu_read(xen_lazy_mode);
-
-	if (mode == old_mode) {
-		this_cpu_inc(xen_lazy_nesting);
-		return;
-	}
-
-	BUG_ON(old_mode != XEN_LAZY_NONE);
+	BUG_ON(this_cpu_read(xen_lazy_mode) != XEN_LAZY_NONE);

 	this_cpu_write(xen_lazy_mode, mode);
 }
@@ -92,10 +84,7 @@ static inline void leave_lazy(enum xen_lazy_mode mode)
 {
 	BUG_ON(this_cpu_read(xen_lazy_mode) != mode);

-	if (this_cpu_read(xen_lazy_nesting) == 0)
-		this_cpu_write(xen_lazy_mode, XEN_LAZY_NONE);
-	else
-		this_cpu_dec(xen_lazy_nesting);
+	this_cpu_write(xen_lazy_mode, XEN_LAZY_NONE);
 }

 enum xen_lazy_mode xen_get_lazy_mode(void);
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 5e57835e999d..919e4df9380b 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -99,7 +99,6 @@ struct tls_descs {
 };

 DEFINE_PER_CPU(enum xen_lazy_mode, xen_lazy_mode) = XEN_LAZY_NONE;
-DEFINE_PER_CPU(unsigned int, xen_lazy_nesting);

 enum xen_lazy_mode xen_get_lazy_mode(void)
 {
-- 
2.43.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/5] Fix lazy mmu mode
  2025-03-03 14:15 [PATCH v2 0/5] Fix lazy mmu mode Ryan Roberts
                   ` (4 preceding siblings ...)
  2025-03-03 14:15 ` [PATCH v2 5/5] Revert "x86/xen: allow nesting of same lazy mode" Ryan Roberts
@ 2025-03-03 14:36 ` Jürgen Groß
  2025-04-10 16:07 ` Alexander Gordeev
  6 siblings, 0 replies; 11+ messages in thread
From: Jürgen Groß @ 2025-03-03 14:36 UTC (permalink / raw)
  To: Ryan Roberts, Andrew Morton, David S. Miller, Andreas Larsson,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas
  Cc: linux-mm, sparclinux, xen-devel, linux-kernel


[-- Attachment #1.1.1: Type: text/plain, Size: 2006 bytes --]

On 03.03.25 15:15, Ryan Roberts wrote:
> Hi All,
> 
> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
> walkers. While lazy mmu mode is already used for kernel mappings in a few
> places, this will extend it's use significantly.
> 
> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
> it looks like there are a bunch of bugs, some of which may be more likely to
> trigger once I extend the use of lazy mmu. So this series attempts to clarify
> the requirements and fix all the bugs in advance of that series. See patch #1
> commit log for all the details.
> 
> Note that I have only been able to compile test these changes but I think they
> are in good enough shape for some linux-next testing.
> 
> Applies on Friday's mm-unstable (5f089a9aa987), as I assume this would be
> preferred via that tree.
> 
> Changes since v1
> ================
>    - split v1 patch #1 into v2 patch #1 and #2; per David
>    - Added Acked-by tags from David and Andreas; Thanks!
>    - Refined the patches which are truely fixes and added to stable to cc
> 
> Thanks,
> Ryan
> 
> Ryan Roberts (5):
>    mm: Fix lazy mmu docs and usage
>    fs/proc/task_mmu: Reduce scope of lazy mmu region
>    sparc/mm: Disable preemption in lazy mmu mode
>    sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
>    Revert "x86/xen: allow nesting of same lazy mode"
> 
>   arch/sparc/include/asm/pgtable_64.h   |  2 --
>   arch/sparc/mm/tlb.c                   |  5 ++++-
>   arch/x86/include/asm/xen/hypervisor.h | 15 ++-------------
>   arch/x86/xen/enlighten_pv.c           |  1 -
>   fs/proc/task_mmu.c                    | 11 ++++-------
>   include/linux/pgtable.h               | 14 ++++++++------
>   6 files changed, 18 insertions(+), 30 deletions(-)

For the series:

Acked-by: Juergen Gross <jgross@suse.com>


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/5] Fix lazy mmu mode
  2025-03-03 14:15 [PATCH v2 0/5] Fix lazy mmu mode Ryan Roberts
                   ` (5 preceding siblings ...)
  2025-03-03 14:36 ` [PATCH v2 0/5] Fix lazy mmu mode Jürgen Groß
@ 2025-04-10 16:07 ` Alexander Gordeev
  2025-04-14 13:22   ` Ryan Roberts
  6 siblings, 1 reply; 11+ messages in thread
From: Alexander Gordeev @ 2025-04-10 16:07 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas, linux-mm, sparclinux, xen-devel, linux-kernel

On Mon, Mar 03, 2025 at 02:15:34PM +0000, Ryan Roberts wrote:

Hi Ryan,

> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
> walkers. While lazy mmu mode is already used for kernel mappings in a few
> places, this will extend it's use significantly.
> 
> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
> it looks like there are a bunch of bugs, some of which may be more likely to
> trigger once I extend the use of lazy mmu.

Do you have any idea about generic code issues as result of not adhering to
the originally stated requirement:

  /*
   ...
   * the PTE updates which happen during this window.  Note that using this
   * interface requires that read hazards be removed from the code.  A read
   * hazard could result in the direct mode hypervisor case, since the actual
   * write to the page tables may not yet have taken place, so reads though
   * a raw PTE pointer after it has been modified are not guaranteed to be
   * up to date.
   ...
   */

I tried to follow few code paths and at least this one does not look so good:

copy_pte_range(..., src_pte, ...)
	ret = copy_nonpresent_pte(..., src_pte, ...)
		try_restore_exclusive_pte(..., src_pte, ...)	// is_device_exclusive_entry(entry)
			restore_exclusive_pte(..., ptep, ...)
				set_pte_at(..., ptep, ...)
					set_pte(ptep, pte);	// save in lazy mmu mode

	// ret == -ENOENT

	ptent = ptep_get(src_pte);				// lazy mmu save is not observed
	ret = copy_present_ptes(..., ptent, ...);		// wrong ptent used

I am not aware whether the effort to "read hazards be removed from the code"
has ever been made and the generic code is safe in this regard.

What is your take on this?

Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/5] Fix lazy mmu mode
  2025-04-10 16:07 ` Alexander Gordeev
@ 2025-04-14 13:22   ` Ryan Roberts
  2025-04-14 14:04     ` Alexander Gordeev
  0 siblings, 1 reply; 11+ messages in thread
From: Ryan Roberts @ 2025-04-14 13:22 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas, linux-mm, sparclinux, xen-devel, linux-kernel

On 10/04/2025 17:07, Alexander Gordeev wrote:
> On Mon, Mar 03, 2025 at 02:15:34PM +0000, Ryan Roberts wrote:
> 
> Hi Ryan,
> 
>> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
>> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
>> walkers. While lazy mmu mode is already used for kernel mappings in a few
>> places, this will extend it's use significantly.
>>
>> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
>> it looks like there are a bunch of bugs, some of which may be more likely to
>> trigger once I extend the use of lazy mmu.
> 
> Do you have any idea about generic code issues as result of not adhering to
> the originally stated requirement:
> 
>   /*
>    ...
>    * the PTE updates which happen during this window.  Note that using this
>    * interface requires that read hazards be removed from the code.  A read
>    * hazard could result in the direct mode hypervisor case, since the actual
>    * write to the page tables may not yet have taken place, so reads though
>    * a raw PTE pointer after it has been modified are not guaranteed to be
>    * up to date.
>    ...
>    */
> 
> I tried to follow few code paths and at least this one does not look so good:
> 
> copy_pte_range(..., src_pte, ...)
> 	ret = copy_nonpresent_pte(..., src_pte, ...)
> 		try_restore_exclusive_pte(..., src_pte, ...)	// is_device_exclusive_entry(entry)
> 			restore_exclusive_pte(..., ptep, ...)
> 				set_pte_at(..., ptep, ...)
> 					set_pte(ptep, pte);	// save in lazy mmu mode
> 
> 	// ret == -ENOENT
> 
> 	ptent = ptep_get(src_pte);				// lazy mmu save is not observed
> 	ret = copy_present_ptes(..., ptent, ...);		// wrong ptent used
> 
> I am not aware whether the effort to "read hazards be removed from the code"
> has ever been made and the generic code is safe in this regard.
> 
> What is your take on this?

Hmm, that looks like a bug to me, at least based on the stated requirements.
Although this is not a "read through a raw PTE *pointer*", it is a ptep_get().
The arch code can override that so I guess it has an opportunity to flush. But I
don't think any arches are currently doing that.

Probably the simplest fix is to add arch_flush_lazy_mmu_mode() before the
ptep_get()?

It won't be a problem in practice for arm64, since the pgtables are always
updated immediately. I just want to use these hooks to defer/batch barriers in
certain cases.

And this is a pre-existing issue for the arches that use lazy mmu with
device-exclusive mappings, which my extending lazy mmu into vmalloc won't
exacerbate.

Would you be willing/able to submit a fix?

Thanks,
Ryan


> 
> Thanks!



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/5] Fix lazy mmu mode
  2025-04-14 13:22   ` Ryan Roberts
@ 2025-04-14 14:04     ` Alexander Gordeev
  2025-04-14 14:11       ` Ryan Roberts
  0 siblings, 1 reply; 11+ messages in thread
From: Alexander Gordeev @ 2025-04-14 14:04 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas, linux-mm, sparclinux, xen-devel, linux-kernel

On Mon, Apr 14, 2025 at 02:22:53PM +0100, Ryan Roberts wrote:
> On 10/04/2025 17:07, Alexander Gordeev wrote:
> >> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
> >> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
> >> walkers. While lazy mmu mode is already used for kernel mappings in a few
> >> places, this will extend it's use significantly.
> >>
> >> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
> >> it looks like there are a bunch of bugs, some of which may be more likely to
> >> trigger once I extend the use of lazy mmu.
> > 
> > Do you have any idea about generic code issues as result of not adhering to
> > the originally stated requirement:
> > 
> >   /*
> >    ...
> >    * the PTE updates which happen during this window.  Note that using this
> >    * interface requires that read hazards be removed from the code.  A read
> >    * hazard could result in the direct mode hypervisor case, since the actual
> >    * write to the page tables may not yet have taken place, so reads though
> >    * a raw PTE pointer after it has been modified are not guaranteed to be
> >    * up to date.
> >    ...
> >    */
> > 
> > I tried to follow few code paths and at least this one does not look so good:
> > 
> > copy_pte_range(..., src_pte, ...)
> > 	ret = copy_nonpresent_pte(..., src_pte, ...)
> > 		try_restore_exclusive_pte(..., src_pte, ...)	// is_device_exclusive_entry(entry)
> > 			restore_exclusive_pte(..., ptep, ...)
> > 				set_pte_at(..., ptep, ...)
> > 					set_pte(ptep, pte);	// save in lazy mmu mode
> > 
> > 	// ret == -ENOENT
> > 
> > 	ptent = ptep_get(src_pte);				// lazy mmu save is not observed
> > 	ret = copy_present_ptes(..., ptent, ...);		// wrong ptent used
> > 
> > I am not aware whether the effort to "read hazards be removed from the code"
> > has ever been made and the generic code is safe in this regard.
> > 
> > What is your take on this?
> 
> Hmm, that looks like a bug to me, at least based on the stated requirements.
> Although this is not a "read through a raw PTE *pointer*", it is a ptep_get().
> The arch code can override that so I guess it has an opportunity to flush. But I
> don't think any arches are currently doing that.
> 
> Probably the simplest fix is to add arch_flush_lazy_mmu_mode() before the
> ptep_get()?

Which would completely revert the very idea of the lazy mmu mode?
(As one would flush on every PTE page table iteration).

> It won't be a problem in practice for arm64, since the pgtables are always
> updated immediately. I just want to use these hooks to defer/batch barriers in
> certain cases.
> 
> And this is a pre-existing issue for the arches that use lazy mmu with
> device-exclusive mappings, which my extending lazy mmu into vmalloc won't
> exacerbate.
> 
> Would you be willing/able to submit a fix?

Well, we have a dozen of lazy mmu cases and I would guess it is not the
only piece of code that seems affected. I was thinking about debug feature
that could help spotting all troubled locations.

Then we could assess and decide if it is feasible to fix. Just turning the
code above into the PTE read-modify-update pattern is quite an exercise...

> Thanks,
> Ryan


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/5] Fix lazy mmu mode
  2025-04-14 14:04     ` Alexander Gordeev
@ 2025-04-14 14:11       ` Ryan Roberts
  0 siblings, 0 replies; 11+ messages in thread
From: Ryan Roberts @ 2025-04-14 14:11 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrew Morton, David S. Miller, Andreas Larsson, Juergen Gross,
	Boris Ostrovsky, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Matthew Wilcox (Oracle),
	Catalin Marinas, linux-mm, sparclinux, xen-devel, linux-kernel

On 14/04/2025 15:04, Alexander Gordeev wrote:
> On Mon, Apr 14, 2025 at 02:22:53PM +0100, Ryan Roberts wrote:
>> On 10/04/2025 17:07, Alexander Gordeev wrote:
>>>> I'm planning to implement lazy mmu mode for arm64 to optimize vmalloc. As part
>>>> of that, I will extend lazy mmu mode to cover kernel mappings in vmalloc table
>>>> walkers. While lazy mmu mode is already used for kernel mappings in a few
>>>> places, this will extend it's use significantly.
>>>>
>>>> Having reviewed the existing lazy mmu implementations in powerpc, sparc and x86,
>>>> it looks like there are a bunch of bugs, some of which may be more likely to
>>>> trigger once I extend the use of lazy mmu.
>>>
>>> Do you have any idea about generic code issues as result of not adhering to
>>> the originally stated requirement:
>>>
>>>   /*
>>>    ...
>>>    * the PTE updates which happen during this window.  Note that using this
>>>    * interface requires that read hazards be removed from the code.  A read
>>>    * hazard could result in the direct mode hypervisor case, since the actual
>>>    * write to the page tables may not yet have taken place, so reads though
>>>    * a raw PTE pointer after it has been modified are not guaranteed to be
>>>    * up to date.
>>>    ...
>>>    */
>>>
>>> I tried to follow few code paths and at least this one does not look so good:
>>>
>>> copy_pte_range(..., src_pte, ...)
>>> 	ret = copy_nonpresent_pte(..., src_pte, ...)
>>> 		try_restore_exclusive_pte(..., src_pte, ...)	// is_device_exclusive_entry(entry)
>>> 			restore_exclusive_pte(..., ptep, ...)
>>> 				set_pte_at(..., ptep, ...)
>>> 					set_pte(ptep, pte);	// save in lazy mmu mode
>>>
>>> 	// ret == -ENOENT
>>>
>>> 	ptent = ptep_get(src_pte);				// lazy mmu save is not observed
>>> 	ret = copy_present_ptes(..., ptent, ...);		// wrong ptent used
>>>
>>> I am not aware whether the effort to "read hazards be removed from the code"
>>> has ever been made and the generic code is safe in this regard.
>>>
>>> What is your take on this?
>>
>> Hmm, that looks like a bug to me, at least based on the stated requirements.
>> Although this is not a "read through a raw PTE *pointer*", it is a ptep_get().
>> The arch code can override that so I guess it has an opportunity to flush. But I
>> don't think any arches are currently doing that.
>>
>> Probably the simplest fix is to add arch_flush_lazy_mmu_mode() before the
>> ptep_get()?
> 
> Which would completely revert the very idea of the lazy mmu mode?
> (As one would flush on every PTE page table iteration).

Well yes, but this is a pretty rare path, I'm guessing?

> 
>> It won't be a problem in practice for arm64, since the pgtables are always
>> updated immediately. I just want to use these hooks to defer/batch barriers in
>> certain cases.
>>
>> And this is a pre-existing issue for the arches that use lazy mmu with
>> device-exclusive mappings, which my extending lazy mmu into vmalloc won't
>> exacerbate.
>>
>> Would you be willing/able to submit a fix?
> 
> Well, we have a dozen of lazy mmu cases and I would guess it is not the
> only piece of code that seems affected. I was thinking about debug feature
> that could help spotting all troubled locations.
> 
> Then we could assess and decide if it is feasible to fix. Just turning the
> code above into the PTE read-modify-update pattern is quite an exercise...
> 
>> Thanks,
>> Ryan



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-04-14 14:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-03 14:15 [PATCH v2 0/5] Fix lazy mmu mode Ryan Roberts
2025-03-03 14:15 ` [PATCH v2 1/5] mm: Fix lazy mmu docs and usage Ryan Roberts
2025-03-03 14:15 ` [PATCH v2 2/5] fs/proc/task_mmu: Reduce scope of lazy mmu region Ryan Roberts
2025-03-03 14:15 ` [PATCH v2 3/5] sparc/mm: Disable preemption in lazy mmu mode Ryan Roberts
2025-03-03 14:15 ` [PATCH v2 4/5] sparc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes Ryan Roberts
2025-03-03 14:15 ` [PATCH v2 5/5] Revert "x86/xen: allow nesting of same lazy mode" Ryan Roberts
2025-03-03 14:36 ` [PATCH v2 0/5] Fix lazy mmu mode Jürgen Groß
2025-04-10 16:07 ` Alexander Gordeev
2025-04-14 13:22   ` Ryan Roberts
2025-04-14 14:04     ` Alexander Gordeev
2025-04-14 14:11       ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox