* [PATCH v5 00/12] Nesting support for lazy MMU mode
@ 2025-11-24 13:22 Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
` (12 more replies)
0 siblings, 13 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
When the lazy MMU mode was introduced eons ago, it wasn't made clear
whether such a sequence was legal:
arch_enter_lazy_mmu_mode()
...
arch_enter_lazy_mmu_mode()
...
arch_leave_lazy_mmu_mode()
...
arch_leave_lazy_mmu_mode()
It seems fair to say that nested calls to
arch_{enter,leave}_lazy_mmu_mode() were not expected, and most
architectures never explicitly supported it.
Nesting does in fact occur in certain configurations, and avoiding it
has proved difficult. This series therefore enables lazy_mmu sections to
nest, on all architectures.
Nesting is handled using a counter in task_struct (patch 8), like other
stateless APIs such as pagefault_{disable,enable}(). This is fully
handled in a new generic layer in <linux/pgtable.h>; the arch_* API
remains unchanged. A new pair of calls, lazy_mmu_mode_{pause,resume}(),
is also introduced to allow functions that are called with the lazy MMU
mode enabled to temporarily pause it, regardless of nesting.
An arch now opts in to using the lazy MMU mode by selecting
CONFIG_ARCH_LAZY_MMU; this is more appropriate now that we have a
generic API, especially with state conditionally added to task_struct.
---
Background: Ryan Roberts' series from March [1] attempted to prevent
nesting from ever occurring, and mostly succeeded. Unfortunately, a
corner case (DEBUG_PAGEALLOC) may still cause nesting to occur on arm64.
Ryan proposed [2] to address that corner case at the generic level but
this approach received pushback; [3] then attempted to solve the issue
on arm64 only, but it was deemed too fragile.
It feels generally difficult to guarantee that lazy_mmu sections don't
nest, because callers of various standard mm functions do not know if
the function uses lazy_mmu itself.
The overall approach in v3/v4 is very close to what David Hildenbrand
proposed on v2 [4].
Unlike in v1/v2, no special provision is made for architectures to
save/restore extra state when entering/leaving the mode. Based on the
discussions so far, this does not seem to be required - an arch can
store any relevant state in thread_struct during arch_enter() and
restore it in arch_leave(). Nesting is not a concern as these functions
are only called at the top level, not in nested sections.
The introduction of a generic layer, and tracking of the lazy MMU state
in task_struct, also allows to streamline the arch callbacks - this
series removes 67 lines from arch/.
Patch overview:
* Patch 1: cleanup - avoids having to deal with the powerpc
context-switching code
* Patch 2-4: prepare arch_flush_lazy_mmu_mode() to be called from the
generic layer (patch 8)
* Patch 5-6: new API + CONFIG_ARCH_LAZY_MMU
* Patch 7: ensure correctness in interrupt context
* Patch 8: nesting support
* Patch 9-12: replace arch-specific tracking of lazy MMU mode with
generic API
This series has been tested by running the mm kselftests on arm64 with
DEBUG_VM, DEBUG_PAGEALLOC, KFENCE and KASAN. It was also build-tested on
other architectures (with and without XEN_PV on x86).
- Kevin
[1] https://lore.kernel.org/all/20250303141542.3371656-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20250530140446.2387131-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20250606135654.178300-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/ef343405-c394-4763-a79f-21381f217b6c@redhat.com/
---
Changelog
v4..v5:
- Rebased on mm-unstable
- Patch 3: added missing radix_enabled() check in arch_flush()
[Ritesh Harjani]
- Patch 6: declare arch_flush_lazy_mmu_mode() as static inline on x86
[Ryan Roberts]
- Patch 7 (formerly 12): moved before patch 8 to ensure correctness in
interrupt context [Ryan]. The diffs in in_lazy_mmu_mode() and
queue_pte_barriers() are moved to patch 8 and 9 resp.
- Patch 8:
* Removed all restrictions regarding lazy_mmu_mode_{pause,resume}().
They may now be called even when lazy MMU isn't enabled, and
any call to lazy_mmu_mode_* may be made while paused (such calls
will be ignored). [David, Ryan]
* lazy_mmu_state.{nesting_level,active} are replaced with
{enable_count,pause_count} to track arbitrary nesting of both
enable/disable and pause/resume [Ryan]
* Added __task_lazy_mmu_mode_active() for use in patch 12 [David]
* Added documentation for all the functions [Ryan]
- Patch 9: keep existing test + set TIF_LAZY_MMU_PENDING instead of
atomic RMW [David, Ryan]
- Patch 12: use __task_lazy_mmu_mode_active() instead of accessing
lazy_mmu_state directly [David]
- Collected R-b/A-b tags
v4: https://lore.kernel.org/all/20251029100909.3381140-1-kevin.brodsky@arm.com/
v3..v4:
- Patch 2: restored ordering of preempt_{disable,enable}() [Dave Hansen]
- Patch 5 onwards: s/ARCH_LAZY_MMU/ARCH_HAS_LAZY_MMU_MODE/ [Mike Rapoport]
- Patch 7: renamed lazy_mmu_state members, removed VM_BUG_ON(),
reordered writes to lazy_mmu_state members [David Hildenbrand]
- Dropped patch 13 as it doesn't seem justified [David H]
- Various improvements to commit messages [David H]
v3: https://lore.kernel.org/all/20251015082727.2395128-1-kevin.brodsky@arm.com/
v2..v3:
- Full rewrite; dropped all Acked-by/Reviewed-by.
- Rebased on v6.18-rc1.
v2: https://lore.kernel.org/all/20250908073931.4159362-1-kevin.brodsky@arm.com/
v1..v2:
- Rebased on mm-unstable.
- Patch 2: handled new calls to enter()/leave(), clarified how the "flush"
pattern (leave() followed by enter()) is handled.
- Patch 5,6: removed unnecessary local variable [Alexander Gordeev's
suggestion].
- Added Mike Rapoport's Acked-by.
v1: https://lore.kernel.org/all/20250904125736.3918646-1-kevin.brodsky@arm.com/
---
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: x86@kernel.org
---
Alexander Gordeev (1):
powerpc/64s: Do not re-activate batched TLB flush
Kevin Brodsky (11):
x86/xen: simplify flush_lazy_mmu()
powerpc/mm: implement arch_flush_lazy_mmu_mode()
sparc/mm: implement arch_flush_lazy_mmu_mode()
mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
mm: introduce generic lazy_mmu helpers
mm: bail out of lazy_mmu_mode_* in interrupt context
mm: enable lazy_mmu sections to nest
arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
powerpc/mm: replace batch->active with in_lazy_mmu_mode()
sparc/mm: replace batch->active with in_lazy_mmu_mode()
x86/xen: use lazy_mmu_state when context-switching
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable.h | 41 +----
arch/arm64/include/asm/thread_info.h | 3 +-
arch/arm64/mm/mmu.c | 4 +-
arch/arm64/mm/pageattr.c | 4 +-
.../include/asm/book3s/64/tlbflush-hash.h | 20 ++-
arch/powerpc/include/asm/thread_info.h | 2 -
arch/powerpc/kernel/process.c | 25 ---
arch/powerpc/mm/book3s64/hash_tlb.c | 10 +-
arch/powerpc/mm/book3s64/subpage_prot.c | 4 +-
arch/powerpc/platforms/Kconfig.cputype | 1 +
arch/sparc/Kconfig | 1 +
arch/sparc/include/asm/tlbflush_64.h | 5 +-
arch/sparc/mm/tlb.c | 14 +-
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/misc.h | 1 +
arch/x86/boot/startup/sme.c | 1 +
arch/x86/include/asm/paravirt.h | 1 -
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/include/asm/thread_info.h | 4 +-
arch/x86/xen/enlighten_pv.c | 3 +-
arch/x86/xen/mmu_pv.c | 6 +-
fs/proc/task_mmu.c | 4 +-
include/linux/mm_types_task.h | 5 +
include/linux/pgtable.h | 147 +++++++++++++++++-
include/linux/sched.h | 45 ++++++
mm/Kconfig | 3 +
mm/kasan/shadow.c | 8 +-
mm/madvise.c | 18 +--
mm/memory.c | 16 +-
mm/migrate_device.c | 8 +-
mm/mprotect.c | 4 +-
mm/mremap.c | 4 +-
mm/userfaultfd.c | 4 +-
mm/vmalloc.c | 12 +-
mm/vmscan.c | 12 +-
36 files changed, 282 insertions(+), 161 deletions(-)
base-commit: 1f1edd95f9231ba58a1e535b10200cb1eeaf1f67
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 01/12] powerpc/64s: Do not re-activate batched TLB flush
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
` (11 subsequent siblings)
12 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
From: Alexander Gordeev <agordeev@linux.ibm.com>
Since commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
lazy mmu mode") a task can not be preempted while in lazy MMU mode.
Therefore, the batch re-activation code is never called, so remove it.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/powerpc/include/asm/thread_info.h | 2 --
arch/powerpc/kernel/process.c | 25 -------------------------
2 files changed, 27 deletions(-)
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index b0f200aba2b3..97f35f9b1a96 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -154,12 +154,10 @@ void arch_setup_new_exec(void);
/* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
#define TLF_NAPPING 0 /* idle thread enabled NAP mode */
#define TLF_SLEEPING 1 /* suspend code enabled SLEEP mode */
-#define TLF_LAZY_MMU 3 /* tlb_batch is active */
#define TLF_RUNLATCH 4 /* Is the runlatch enabled? */
#define _TLF_NAPPING (1 << TLF_NAPPING)
#define _TLF_SLEEPING (1 << TLF_SLEEPING)
-#define _TLF_LAZY_MMU (1 << TLF_LAZY_MMU)
#define _TLF_RUNLATCH (1 << TLF_RUNLATCH)
#ifndef __ASSEMBLER__
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index eb23966ac0a9..9237dcbeee4a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1281,9 +1281,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
{
struct thread_struct *new_thread, *old_thread;
struct task_struct *last;
-#ifdef CONFIG_PPC_64S_HASH_MMU
- struct ppc64_tlb_batch *batch;
-#endif
new_thread = &new->thread;
old_thread = ¤t->thread;
@@ -1291,14 +1288,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
WARN_ON(!irqs_disabled());
#ifdef CONFIG_PPC_64S_HASH_MMU
- batch = this_cpu_ptr(&ppc64_tlb_batch);
- if (batch->active) {
- current_thread_info()->local_flags |= _TLF_LAZY_MMU;
- if (batch->index)
- __flush_tlb_pending(batch);
- batch->active = 0;
- }
-
/*
* On POWER9 the copy-paste buffer can only paste into
* foreign real addresses, so unprivileged processes can not
@@ -1369,20 +1358,6 @@ struct task_struct *__switch_to(struct task_struct *prev,
*/
#ifdef CONFIG_PPC_BOOK3S_64
-#ifdef CONFIG_PPC_64S_HASH_MMU
- /*
- * This applies to a process that was context switched while inside
- * arch_enter_lazy_mmu_mode(), to re-activate the batch that was
- * deactivated above, before _switch(). This will never be the case
- * for new tasks.
- */
- if (current_thread_info()->local_flags & _TLF_LAZY_MMU) {
- current_thread_info()->local_flags &= ~_TLF_LAZY_MMU;
- batch = this_cpu_ptr(&ppc64_tlb_batch);
- batch->active = 1;
- }
-#endif
-
/*
* Math facilities are masked out of the child MSR in copy_thread.
* A new task does not need to restore_math because it will
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 02/12] x86/xen: simplify flush_lazy_mmu()
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-12-04 3:36 ` Anshuman Khandual
2025-11-24 13:22 ` [PATCH v5 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
` (10 subsequent siblings)
12 siblings, 1 reply; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
arch_flush_lazy_mmu_mode() is called when outstanding batched
pgtable operations must be completed immediately. There should
however be no need to leave and re-enter lazy MMU completely. The
only part of that sequence that we really need is xen_mc_flush();
call it directly.
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/x86/xen/mmu_pv.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2a4a8deaf612..7a35c3393df4 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2139,10 +2139,8 @@ static void xen_flush_lazy_mmu(void)
{
preempt_disable();
- if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
- arch_leave_lazy_mmu_mode();
- arch_enter_lazy_mmu_mode();
- }
+ if (xen_get_lazy_mode() == XEN_LAZY_MMU)
+ xen_mc_flush();
preempt_enable();
}
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode()
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 04/12] sparc/mm: " Kevin Brodsky
` (9 subsequent siblings)
12 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
Upcoming changes to the lazy_mmu API will cause
arch_flush_lazy_mmu_mode() to be called when leaving a nested
lazy_mmu section.
Move the relevant logic from arch_leave_lazy_mmu_mode() to
arch_flush_lazy_mmu_mode() and have the former call the latter. The
radix_enabled() check is required in both as
arch_flush_lazy_mmu_mode() will be called directly from the generic
layer in a subsequent patch.
Note: the additional this_cpu_ptr() and radix_enabled() calls on the
arch_leave_lazy_mmu_mode() path will be removed in a subsequent
patch.
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
.../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 146287d9580f..2d45f57df169 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -41,7 +41,7 @@ static inline void arch_enter_lazy_mmu_mode(void)
batch->active = 1;
}
-static inline void arch_leave_lazy_mmu_mode(void)
+static inline void arch_flush_lazy_mmu_mode(void)
{
struct ppc64_tlb_batch *batch;
@@ -51,12 +51,21 @@ static inline void arch_leave_lazy_mmu_mode(void)
if (batch->index)
__flush_tlb_pending(batch);
+}
+
+static inline void arch_leave_lazy_mmu_mode(void)
+{
+ struct ppc64_tlb_batch *batch;
+
+ if (radix_enabled())
+ return;
+ batch = this_cpu_ptr(&ppc64_tlb_batch);
+
+ arch_flush_lazy_mmu_mode();
batch->active = 0;
preempt_enable();
}
-#define arch_flush_lazy_mmu_mode() do {} while (0)
-
extern void hash__tlbiel_all(unsigned int action);
extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize,
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 04/12] sparc/mm: implement arch_flush_lazy_mmu_mode()
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (2 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
` (8 subsequent siblings)
12 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
Upcoming changes to the lazy_mmu API will cause
arch_flush_lazy_mmu_mode() to be called when leaving a nested
lazy_mmu section.
Move the relevant logic from arch_leave_lazy_mmu_mode() to
arch_flush_lazy_mmu_mode() and have the former call the latter.
Note: the additional this_cpu_ptr() call on the
arch_leave_lazy_mmu_mode() path will be removed in a subsequent
patch.
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/sparc/include/asm/tlbflush_64.h | 2 +-
arch/sparc/mm/tlb.c | 9 ++++++++-
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 8b8cdaa69272..925bb5d7a4e1 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -43,8 +43,8 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end);
void flush_tlb_pending(void);
void arch_enter_lazy_mmu_mode(void);
+void arch_flush_lazy_mmu_mode(void);
void arch_leave_lazy_mmu_mode(void);
-#define arch_flush_lazy_mmu_mode() do {} while (0)
/* Local cpu only. */
void __flush_tlb_all(void);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index a35ddcca5e76..7b5dfcdb1243 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -59,12 +59,19 @@ void arch_enter_lazy_mmu_mode(void)
tb->active = 1;
}
-void arch_leave_lazy_mmu_mode(void)
+void arch_flush_lazy_mmu_mode(void)
{
struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
if (tb->tlb_nr)
flush_tlb_pending();
+}
+
+void arch_leave_lazy_mmu_mode(void)
+{
+ struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
+
+ arch_flush_lazy_mmu_mode();
tb->active = 0;
preempt_enable();
}
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (3 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 04/12] sparc/mm: " Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-12-01 6:21 ` Anshuman Khandual
2025-11-24 13:22 ` [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
` (7 subsequent siblings)
12 siblings, 1 reply; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
Architectures currently opt in for implementing lazy_mmu helpers by
defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.
In preparation for introducing a generic lazy_mmu layer that will
require storage in task_struct, let's switch to a cleaner approach:
instead of defining a macro, select a CONFIG option.
This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each
arch select it when it implements lazy_mmu helpers.
__HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
relies on the new CONFIG instead.
On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
selected. This creates some complications in arch/x86/boot/, because
a few files manually undefine PARAVIRT* options. As a result
<asm/paravirt.h> does not define the lazy_mmu helpers, but this
breaks the build as <linux/pgtable.h> only defines them if
!CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean
way out of this - let's just undefine that new CONFIG too.
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/pgtable.h | 1 -
arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 --
arch/powerpc/platforms/Kconfig.cputype | 1 +
arch/sparc/Kconfig | 1 +
arch/sparc/include/asm/tlbflush_64.h | 2 --
arch/x86/Kconfig | 1 +
arch/x86/boot/compressed/misc.h | 1 +
arch/x86/boot/startup/sme.c | 1 +
arch/x86/include/asm/paravirt.h | 1 -
include/linux/pgtable.h | 2 +-
mm/Kconfig | 3 +++
12 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6663ffd23f25..74be32f5f446 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -34,6 +34,7 @@ config ARM64
select ARCH_HAS_KCOV
select ARCH_HAS_KERNEL_FPU_SUPPORT if KERNEL_MODE_NEON
select ARCH_HAS_KEEPINITRD
+ select ARCH_HAS_LAZY_MMU_MODE
select ARCH_HAS_MEMBARRIER_SYNC_CORE
select ARCH_HAS_MEM_ENCRYPT
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0944e296dd4a..54f8d6bb6f22 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -80,7 +80,6 @@ static inline void queue_pte_barriers(void)
}
}
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
static inline void arch_enter_lazy_mmu_mode(void)
{
/*
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 2d45f57df169..565c1b7c3eae 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -24,8 +24,6 @@ DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch);
extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-
static inline void arch_enter_lazy_mmu_mode(void)
{
struct ppc64_tlb_batch *batch;
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 4c321a8ea896..f399917c17bd 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -93,6 +93,7 @@ config PPC_BOOK3S_64
select IRQ_WORK
select PPC_64S_HASH_MMU if !PPC_RADIX_MMU
select KASAN_VMALLOC if KASAN
+ select ARCH_HAS_LAZY_MMU_MODE
config PPC_BOOK3E_64
bool "Embedded processors"
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index a630d373e645..2bad14744ca4 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -112,6 +112,7 @@ config SPARC64
select NEED_PER_CPU_PAGE_FIRST_CHUNK
select ARCH_SUPPORTS_SCHED_SMT if SMP
select ARCH_SUPPORTS_SCHED_MC if SMP
+ select ARCH_HAS_LAZY_MMU_MODE
config ARCH_PROC_KCORE_TEXT
def_bool y
diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 925bb5d7a4e1..4e1036728e2f 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -39,8 +39,6 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
void flush_tlb_kernel_range(unsigned long start, unsigned long end);
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-
void flush_tlb_pending(void);
void arch_enter_lazy_mmu_mode(void);
void arch_flush_lazy_mmu_mode(void);
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a3700766a8c0..db769c4addf9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -805,6 +805,7 @@ config PARAVIRT
config PARAVIRT_XXL
bool
depends on X86_64
+ select ARCH_HAS_LAZY_MMU_MODE
config PARAVIRT_DEBUG
bool "paravirt-ops debugging"
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
index db1048621ea2..cdd7f692d9ee 100644
--- a/arch/x86/boot/compressed/misc.h
+++ b/arch/x86/boot/compressed/misc.h
@@ -11,6 +11,7 @@
#undef CONFIG_PARAVIRT
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS
+#undef CONFIG_ARCH_HAS_LAZY_MMU_MODE
#undef CONFIG_KASAN
#undef CONFIG_KASAN_GENERIC
diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
index e7ea65f3f1d6..b76a7c95dfe1 100644
--- a/arch/x86/boot/startup/sme.c
+++ b/arch/x86/boot/startup/sme.c
@@ -24,6 +24,7 @@
#undef CONFIG_PARAVIRT
#undef CONFIG_PARAVIRT_XXL
#undef CONFIG_PARAVIRT_SPINLOCKS
+#undef CONFIG_ARCH_HAS_LAZY_MMU_MODE
/*
* This code runs before CPU feature bits are set. By default, the
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index b5e59a7ba0d0..13f9cd31c8f8 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -526,7 +526,6 @@ static inline void arch_end_context_switch(struct task_struct *next)
PVOP_VCALL1(cpu.end_context_switch, next);
}
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
static inline void arch_enter_lazy_mmu_mode(void)
{
PVOP_VCALL0(mmu.lazy_mode.enter);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b13b6f42be3c..de7d2c7e63eb 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -231,7 +231,7 @@ static inline int pmd_dirty(pmd_t pmd)
* held, but for kernel PTE updates, no lock is held). Nesting is not permitted
* and the mode cannot be used in interrupt context.
*/
-#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
+#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE
static inline void arch_enter_lazy_mmu_mode(void) {}
static inline void arch_leave_lazy_mmu_mode(void) {}
static inline void arch_flush_lazy_mmu_mode(void) {}
diff --git a/mm/Kconfig b/mm/Kconfig
index bd0ea5454af8..a7486fae0cd3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1464,6 +1464,9 @@ config PT_RECLAIM
config FIND_NORMAL_PAGE
def_bool n
+config ARCH_HAS_LAZY_MMU_MODE
+ bool
+
source "mm/damon/Kconfig"
endmenu
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (4 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-28 13:50 ` Alexander Gordeev
2025-12-04 4:17 ` Anshuman Khandual
2025-11-24 13:22 ` [PATCH v5 07/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
` (6 subsequent siblings)
12 siblings, 2 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
The implementation of the lazy MMU mode is currently entirely
arch-specific; core code directly calls arch helpers:
arch_{enter,leave}_lazy_mmu_mode().
We are about to introduce support for nested lazy MMU sections.
As things stand we'd have to duplicate that logic in every arch
implementing lazy_mmu - adding to a fair amount of logic
already duplicated across lazy_mmu implementations.
This patch therefore introduces a new generic layer that calls the
existing arch_* helpers. Two pair of calls are introduced:
* lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
This is the standard case where the mode is enabled for a given
block of code by surrounding it with enable() and disable()
calls.
* lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
This is for situations where the mode is temporarily disabled
by first calling pause() and then resume() (e.g. to prevent any
batching from occurring in a critical section).
The documentation in <linux/pgtable.h> will be updated in a
subsequent patch.
No functional change should be introduced at this stage.
The implementation of enable()/resume() and disable()/pause() is
currently identical, but nesting support will change that.
Most of the call sites have been updated using the following
Coccinelle script:
@@
@@
{
...
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
...
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
...
}
@@
@@
{
...
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_pause();
...
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_resume();
...
}
A couple of notes regarding x86:
* Xen is currently the only case where explicit handling is required
for lazy MMU when context-switching. This is purely an
implementation detail and using the generic lazy_mmu_mode_*
functions would cause trouble when nesting support is introduced,
because the generic functions must be called from the current task.
For that reason we still use arch_leave() and arch_enter() there.
* x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
places, but only defines it if PARAVIRT_XXL is selected, and we
are removing the fallback in <linux/pgtable.h>. Add a new fallback
definition to <asm/pgtable.h> to keep things building.
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/arm64/mm/mmu.c | 4 ++--
arch/arm64/mm/pageattr.c | 4 ++--
arch/powerpc/mm/book3s64/hash_tlb.c | 8 +++----
arch/powerpc/mm/book3s64/subpage_prot.c | 4 ++--
arch/x86/include/asm/pgtable.h | 1 +
fs/proc/task_mmu.c | 4 ++--
include/linux/pgtable.h | 29 +++++++++++++++++++++----
mm/kasan/shadow.c | 8 +++----
mm/madvise.c | 18 +++++++--------
mm/memory.c | 16 +++++++-------
mm/migrate_device.c | 8 +++----
mm/mprotect.c | 4 ++--
mm/mremap.c | 4 ++--
mm/userfaultfd.c | 4 ++--
mm/vmalloc.c | 12 +++++-----
mm/vmscan.c | 12 +++++-----
16 files changed, 81 insertions(+), 59 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 94e29e3574ff..ce66ae77abaa 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -729,7 +729,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
return -EINVAL;
mutex_lock(&pgtable_split_lock);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
/*
* The split_kernel_leaf_mapping_locked() may sleep, it is not a
@@ -751,7 +751,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
ret = split_kernel_leaf_mapping_locked(end);
}
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
mutex_unlock(&pgtable_split_lock);
return ret;
}
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 5135f2d66958..e4059f13c4ed 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -110,7 +110,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
if (WARN_ON_ONCE(ret))
return ret;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
/*
* The caller must ensure that the range we are operating on does not
@@ -119,7 +119,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
*/
ret = walk_kernel_page_table_range_lockless(start, start + size,
&pageattr_ops, NULL, &data);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
return ret;
}
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 21fcad97ae80..787f7a0e27f0 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -205,7 +205,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
* way to do things but is fine for our needs here.
*/
local_irq_save(flags);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
for (; start < end; start += PAGE_SIZE) {
pte_t *ptep = find_init_mm_pte(start, &hugepage_shift);
unsigned long pte;
@@ -217,7 +217,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
continue;
hpte_need_flush(&init_mm, start, ptep, pte, hugepage_shift);
}
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
local_irq_restore(flags);
}
@@ -237,7 +237,7 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
* way to do things but is fine for our needs here.
*/
local_irq_save(flags);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
start_pte = pte_offset_map(pmd, addr);
if (!start_pte)
goto out;
@@ -249,6 +249,6 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
}
pte_unmap(start_pte);
out:
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
local_irq_restore(flags);
}
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index ec98e526167e..07c47673bba2 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -73,13 +73,13 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte)
return;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
for (; npages > 0; --npages) {
pte_update(mm, addr, pte, 0, 0, 0);
addr += PAGE_SIZE;
++pte;
}
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(pte - 1, ptl);
}
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e33df3da6980..2842fa1f7a2c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -118,6 +118,7 @@ extern pmdval_t early_pmd_flags;
#define __pte(x) native_make_pte(x)
#define arch_end_context_switch(prev) do {} while(0)
+static inline void arch_flush_lazy_mmu_mode(void) {}
#endif /* CONFIG_PARAVIRT_XXL */
static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d00ac179d973..ee1778adcc20 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2737,7 +2737,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
return 0;
}
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
/* Fast path for performing exclusive WP */
@@ -2807,7 +2807,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
if (flush_end)
flush_tlb_range(vma, start, addr);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(start_pte, ptl);
cond_resched();
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index de7d2c7e63eb..c121358dba15 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -231,10 +231,31 @@ static inline int pmd_dirty(pmd_t pmd)
* held, but for kernel PTE updates, no lock is held). Nesting is not permitted
* and the mode cannot be used in interrupt context.
*/
-#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE
-static inline void arch_enter_lazy_mmu_mode(void) {}
-static inline void arch_leave_lazy_mmu_mode(void) {}
-static inline void arch_flush_lazy_mmu_mode(void) {}
+#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+static inline void lazy_mmu_mode_enable(void)
+{
+ arch_enter_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_disable(void)
+{
+ arch_leave_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_pause(void)
+{
+ arch_leave_lazy_mmu_mode();
+}
+
+static inline void lazy_mmu_mode_resume(void)
+{
+ arch_enter_lazy_mmu_mode();
+}
+#else
+static inline void lazy_mmu_mode_enable(void) {}
+static inline void lazy_mmu_mode_disable(void) {}
+static inline void lazy_mmu_mode_pause(void) {}
+static inline void lazy_mmu_mode_resume(void) {}
#endif
#ifndef pte_batch_hint
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 29a751a8a08d..c1433d5cc5db 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
pte_t pte;
int index;
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_pause();
index = PFN_DOWN(addr - data->start);
page = data->pages[index];
@@ -319,7 +319,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
}
spin_unlock(&init_mm.page_table_lock);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_resume();
return 0;
}
@@ -471,7 +471,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
pte_t pte;
int none;
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_pause();
spin_lock(&init_mm.page_table_lock);
pte = ptep_get(ptep);
@@ -483,7 +483,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
if (likely(!none))
__free_page(pfn_to_page(pte_pfn(pte)));
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_resume();
return 0;
}
diff --git a/mm/madvise.c b/mm/madvise.c
index b617b1be0f53..6bf7009fa5ce 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -453,7 +453,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (!start_pte)
return 0;
flush_tlb_batched_pending(mm);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
nr = 1;
ptent = ptep_get(pte);
@@ -461,7 +461,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (++batch_count == SWAP_CLUSTER_MAX) {
batch_count = 0;
if (need_resched()) {
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(start_pte, ptl);
cond_resched();
goto restart;
@@ -497,7 +497,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (!folio_trylock(folio))
continue;
folio_get(folio);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(start_pte, ptl);
start_pte = NULL;
err = split_folio(folio);
@@ -508,7 +508,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (!start_pte)
break;
flush_tlb_batched_pending(mm);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
if (!err)
nr = 0;
continue;
@@ -556,7 +556,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
}
if (start_pte) {
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(start_pte, ptl);
}
if (pageout)
@@ -675,7 +675,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (!start_pte)
return 0;
flush_tlb_batched_pending(mm);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
nr = 1;
ptent = ptep_get(pte);
@@ -724,7 +724,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (!folio_trylock(folio))
continue;
folio_get(folio);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(start_pte, ptl);
start_pte = NULL;
err = split_folio(folio);
@@ -735,7 +735,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (!start_pte)
break;
flush_tlb_batched_pending(mm);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
if (!err)
nr = 0;
continue;
@@ -775,7 +775,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (nr_swap)
add_mm_counter(mm, MM_SWAPENTS, nr_swap);
if (start_pte) {
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(start_pte, ptl);
}
cond_resched();
diff --git a/mm/memory.c b/mm/memory.c
index 6675e87eb7dd..c0c29a3b0bcc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1256,7 +1256,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
orig_src_pte = src_pte;
orig_dst_pte = dst_pte;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
do {
nr = 1;
@@ -1325,7 +1325,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
} while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
addr != end);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(orig_src_pte, src_ptl);
add_mm_rss_vec(dst_mm, rss);
pte_unmap_unlock(orig_dst_pte, dst_ptl);
@@ -1842,7 +1842,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
return addr;
flush_tlb_batched_pending(mm);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
do {
bool any_skipped = false;
@@ -1874,7 +1874,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
add_mm_rss_vec(mm, rss);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
/* Do the actual TLB flush before dropping ptl */
if (force_flush) {
@@ -2813,7 +2813,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
return -ENOMEM;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
do {
BUG_ON(!pte_none(ptep_get(pte)));
if (!pfn_modify_allowed(pfn, prot)) {
@@ -2823,7 +2823,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
pfn++;
} while (pte++, addr += PAGE_SIZE, addr != end);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(mapped_pte, ptl);
return err;
}
@@ -3174,7 +3174,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
return -EINVAL;
}
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
if (fn) {
do {
@@ -3187,7 +3187,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
}
*mask |= PGTBL_PTE_MODIFIED;
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
if (mm != &init_mm)
pte_unmap_unlock(mapped_pte, ptl);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 23379663b1e1..0346c2d7819f 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -271,7 +271,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
if (!ptep)
goto again;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
ptep += (addr - start) / PAGE_SIZE;
for (; addr < end; addr += PAGE_SIZE, ptep++) {
@@ -313,7 +313,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (folio_test_large(folio)) {
int ret;
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(ptep, ptl);
ret = migrate_vma_split_folio(folio,
migrate->fault_page);
@@ -356,7 +356,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (folio && folio_test_large(folio)) {
int ret;
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(ptep, ptl);
ret = migrate_vma_split_folio(folio,
migrate->fault_page);
@@ -485,7 +485,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (unmapped)
flush_tlb_range(walk->vma, start, end);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(ptep - 1, ptl);
return 0;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 283889e4f1ce..c0571445bef7 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -233,7 +233,7 @@ static long change_pte_range(struct mmu_gather *tlb,
is_private_single_threaded = vma_is_single_threaded_private(vma);
flush_tlb_batched_pending(vma->vm_mm);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
do {
nr_ptes = 1;
oldpte = ptep_get(pte);
@@ -379,7 +379,7 @@ static long change_pte_range(struct mmu_gather *tlb,
}
}
} while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(pte - 1, ptl);
return pages;
diff --git a/mm/mremap.c b/mm/mremap.c
index 672264807db6..8275b9772ec1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -260,7 +260,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
flush_tlb_batched_pending(vma->vm_mm);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
@@ -305,7 +305,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
}
}
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
if (force_flush)
flush_tlb_range(vma, old_end - len, old_end);
if (new_ptl != old_ptl)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e6dfd5f28acd..b11f81095fa5 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1103,7 +1103,7 @@ static long move_present_ptes(struct mm_struct *mm,
/* It's safe to drop the reference now as the page-table is holding one. */
folio_put(*first_src_folio);
*first_src_folio = NULL;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
while (true) {
orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
@@ -1140,7 +1140,7 @@ static long move_present_ptes(struct mm_struct *mm,
break;
}
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
if (src_addr > src_start)
flush_tlb_range(src_vma, src_start, src_addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ecbac900c35f..1dea299fbb5a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -108,7 +108,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
if (!pte)
return -ENOMEM;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
do {
if (unlikely(!pte_none(ptep_get(pte)))) {
@@ -134,7 +134,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
pfn++;
} while (pte += PFN_DOWN(size), addr += size, addr != end);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
*mask |= PGTBL_PTE_MODIFIED;
return 0;
}
@@ -366,7 +366,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
unsigned long size = PAGE_SIZE;
pte = pte_offset_kernel(pmd, addr);
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
do {
#ifdef CONFIG_HUGETLB_PAGE
@@ -385,7 +385,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
WARN_ON(!pte_none(ptent) && !pte_present(ptent));
} while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
*mask |= PGTBL_PTE_MODIFIED;
}
@@ -533,7 +533,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
if (!pte)
return -ENOMEM;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
do {
struct page *page = pages[*nr];
@@ -555,7 +555,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
(*nr)++;
} while (pte++, addr += PAGE_SIZE, addr != end);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
*mask |= PGTBL_PTE_MODIFIED;
return err;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 92980b072121..564c97a9362f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3515,7 +3515,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
return false;
}
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
restart:
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
unsigned long pfn;
@@ -3556,7 +3556,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
goto restart;
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
pte_unmap_unlock(pte, ptl);
return suitable_to_scan(total, young);
@@ -3597,7 +3597,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
if (!spin_trylock(ptl))
goto done;
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
do {
unsigned long pfn;
@@ -3644,7 +3644,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
walk_update_folio(walk, last, gen, dirty);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
spin_unlock(ptl);
done:
*first = -1;
@@ -4243,7 +4243,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
}
}
- arch_enter_lazy_mmu_mode();
+ lazy_mmu_mode_enable();
pte -= (addr - start) / PAGE_SIZE;
@@ -4277,7 +4277,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
walk_update_folio(walk, last, gen, dirty);
- arch_leave_lazy_mmu_mode();
+ lazy_mmu_mode_disable();
/* feedback from rmap walkers to page table walkers */
if (mm_state && suitable_to_scan(i, young))
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 07/12] mm: bail out of lazy_mmu_mode_* in interrupt context
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (5 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 14:11 ` David Hildenbrand (Red Hat)
2025-12-04 4:34 ` Anshuman Khandual
2025-11-24 13:22 ` [PATCH v5 08/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
` (5 subsequent siblings)
12 siblings, 2 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
The lazy MMU mode cannot be used in interrupt context. This is
documented in <linux/pgtable.h>, but isn't consistently handled
across architectures.
arm64 ensures that calls to lazy_mmu_mode_* have no effect in
interrupt context, because such calls do occur in certain
configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier
batching in interrupt contexts"). Other architectures do not check
this situation, most likely because it hasn't occurred so far.
Let's handle this in the new generic lazy_mmu layer, in the same
fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt().
Also remove the arm64 handling that is now redundant.
Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is
disabled while in interrupt (see queue_pte_barriers() and
xen_get_lazy_mode() respectively). This will be handled in the
generic layer in a subsequent patch.
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/arm64/include/asm/pgtable.h | 9 ---------
include/linux/pgtable.h | 17 +++++++++++++++--
2 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 54f8d6bb6f22..e596899f4029 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -94,26 +94,17 @@ static inline void arch_enter_lazy_mmu_mode(void)
* keeps tracking simple.
*/
- if (in_interrupt())
- return;
-
set_thread_flag(TIF_LAZY_MMU);
}
static inline void arch_flush_lazy_mmu_mode(void)
{
- if (in_interrupt())
- return;
-
if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
emit_pte_barriers();
}
static inline void arch_leave_lazy_mmu_mode(void)
{
- if (in_interrupt())
- return;
-
arch_flush_lazy_mmu_mode();
clear_thread_flag(TIF_LAZY_MMU);
}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c121358dba15..8ff6fdb4b13d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,27 +228,40 @@ static inline int pmd_dirty(pmd_t pmd)
* of the lazy mode. So the implementation must assume preemption may be enabled
* and cpu migration is possible; it must take steps to be robust against this.
* (In practice, for user PTE updates, the appropriate page table lock(s) are
- * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
- * and the mode cannot be used in interrupt context.
+ * held, but for kernel PTE updates, no lock is held). The mode is disabled in
+ * interrupt context and calls to the lazy_mmu API have no effect.
+ * Nesting is not permitted.
*/
#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
static inline void lazy_mmu_mode_enable(void)
{
+ if (in_interrupt())
+ return;
+
arch_enter_lazy_mmu_mode();
}
static inline void lazy_mmu_mode_disable(void)
{
+ if (in_interrupt())
+ return;
+
arch_leave_lazy_mmu_mode();
}
static inline void lazy_mmu_mode_pause(void)
{
+ if (in_interrupt())
+ return;
+
arch_leave_lazy_mmu_mode();
}
static inline void lazy_mmu_mode_resume(void)
{
+ if (in_interrupt())
+ return;
+
arch_enter_lazy_mmu_mode();
}
#else
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (6 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 07/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 14:09 ` David Hildenbrand (Red Hat)
` (3 more replies)
2025-11-24 13:22 ` [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
` (4 subsequent siblings)
12 siblings, 4 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
Despite recent efforts to prevent lazy_mmu sections from nesting, it
remains difficult to ensure that it never occurs - and in fact it
does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
made nesting tolerable on arm64, but without truly supporting it:
the inner call to leave() disables the batching optimisation before
the outer section ends.
This patch actually enables lazy_mmu sections to nest by tracking
the nesting level in task_struct, in a similar fashion to e.g.
pagefault_{enable,disable}(). This is fully handled by the generic
lazy_mmu helpers that were recently introduced.
lazy_mmu sections were not initially intended to nest, so we need to
clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
This patch takes the following approach:
* The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
* Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
to the arch via arch_{enter,leave} - lazy MMU remains enabled so
the assumption is that these callbacks are not relevant. However,
existing code may rely on a call to disable() to flush any batched
state, regardless of nesting. arch_flush_lazy_mmu_mode() is
therefore called in that situation.
A separate interface was recently introduced to temporarily pause
the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
exits the mode *regardless of the nesting level*, and resume()
restores the mode at the same nesting level.
pause()/resume() are themselves allowed to nest, so we actually
store two nesting levels in task_struct: enable_count and
pause_count. A new helper in_lazy_mmu_mode() is introduced to
determine whether we are currently in lazy MMU mode; this will be
used in subsequent patches to replace the various ways arch's
currently track whether the mode is enabled.
In summary (enable/pause represent the values *after* the call):
lazy_mmu_mode_enable() -> arch_enter() enable=1 pause=0
lazy_mmu_mode_enable() -> ø enable=2 pause=0
lazy_mmu_mode_pause() -> arch_leave() enable=2 pause=1
lazy_mmu_mode_resume() -> arch_enter() enable=2 pause=0
lazy_mmu_mode_disable() -> arch_flush() enable=1 pause=0
lazy_mmu_mode_disable() -> arch_leave() enable=0 pause=0
Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
headers included by <linux/pgtable.h> to use it.
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/arm64/include/asm/pgtable.h | 12 ----
include/linux/mm_types_task.h | 5 ++
include/linux/pgtable.h | 115 +++++++++++++++++++++++++++++--
include/linux/sched.h | 45 ++++++++++++
4 files changed, 158 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e596899f4029..a7d99dee3dc4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
static inline void arch_enter_lazy_mmu_mode(void)
{
- /*
- * lazy_mmu_mode is not supposed to permit nesting. But in practice this
- * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
- * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
- * permissions on the linear map with apply_to_page_range(), which
- * re-enters lazy_mmu_mode. So we tolerate nesting in our
- * implementation. The first call to arch_leave_lazy_mmu_mode() will
- * flush and clear the flag such that the remainder of the work in the
- * outer nest behaves as if outside of lazy mmu mode. This is safe and
- * keeps tracking simple.
- */
-
set_thread_flag(TIF_LAZY_MMU);
}
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index a82aa80c0ba4..11bf319d78ec 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
#endif
};
+struct lazy_mmu_state {
+ u8 enable_count;
+ u8 pause_count;
+};
+
#endif /* _LINUX_MM_TYPES_TASK_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8ff6fdb4b13d..24fdb6f5c2e1 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -230,39 +230,140 @@ static inline int pmd_dirty(pmd_t pmd)
* (In practice, for user PTE updates, the appropriate page table lock(s) are
* held, but for kernel PTE updates, no lock is held). The mode is disabled in
* interrupt context and calls to the lazy_mmu API have no effect.
- * Nesting is not permitted.
+ *
+ * The lazy MMU mode is enabled for a given block of code using:
+ *
+ * lazy_mmu_mode_enable();
+ * <code>
+ * lazy_mmu_mode_disable();
+ *
+ * Nesting is permitted: <code> may itself use an enable()/disable() pair.
+ * A nested call to enable() has no functional effect; however disable() causes
+ * any batched architectural state to be flushed regardless of nesting. After a
+ * call to disable(), the caller can therefore rely on all previous page table
+ * modifications to have taken effect, but the lazy MMU mode may still be
+ * enabled.
+ *
+ * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
+ * This can be done using:
+ *
+ * lazy_mmu_mode_pause();
+ * <code>
+ * lazy_mmu_mode_resume();
+ *
+ * pause() ensures that the mode is exited regardless of the nesting level;
+ * resume() re-enters the mode at the same nesting level. Any call to the
+ * lazy_mmu_mode_* API between those two calls has no effect. In particular,
+ * this means that pause()/resume() pairs may nest.
+ *
+ * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
+ * currently enabled.
*/
#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+/**
+ * lazy_mmu_mode_enable() - Enable the lazy MMU mode.
+ *
+ * Enters a new lazy MMU mode section; if the mode was not already enabled,
+ * enables it and calls arch_enter_lazy_mmu_mode().
+ *
+ * Must be paired with a call to lazy_mmu_mode_disable().
+ *
+ * Has no effect if called:
+ * - While paused - see lazy_mmu_mode_pause()
+ * - In interrupt context
+ */
static inline void lazy_mmu_mode_enable(void)
{
- if (in_interrupt())
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
+ if (in_interrupt() || state->pause_count > 0)
return;
- arch_enter_lazy_mmu_mode();
+ VM_WARN_ON_ONCE(state->enable_count == U8_MAX);
+
+ if (state->enable_count++ == 0)
+ arch_enter_lazy_mmu_mode();
}
+/**
+ * lazy_mmu_mode_disable() - Disable the lazy MMU mode.
+ *
+ * Exits the current lazy MMU mode section. If it is the outermost section,
+ * disables the mode and calls arch_leave_lazy_mmu_mode(). Otherwise (nested
+ * section), calls arch_flush_lazy_mmu_mode().
+ *
+ * Must match a call to lazy_mmu_mode_enable().
+ *
+ * Has no effect if called:
+ * - While paused - see lazy_mmu_mode_pause()
+ * - In interrupt context
+ */
static inline void lazy_mmu_mode_disable(void)
{
- if (in_interrupt())
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
+ if (in_interrupt() || state->pause_count > 0)
return;
- arch_leave_lazy_mmu_mode();
+ VM_WARN_ON_ONCE(state->enable_count == 0);
+
+ if (--state->enable_count == 0)
+ arch_leave_lazy_mmu_mode();
+ else /* Exiting a nested section */
+ arch_flush_lazy_mmu_mode();
+
}
+/**
+ * lazy_mmu_mode_pause() - Pause the lazy MMU mode.
+ *
+ * Pauses the lazy MMU mode; if it is currently active, disables it and calls
+ * arch_leave_lazy_mmu_mode().
+ *
+ * Must be paired with a call to lazy_mmu_mode_resume(). Calls to the
+ * lazy_mmu_mode_* API have no effect until the matching resume() call.
+ *
+ * Has no effect if called:
+ * - While paused (inside another pause()/resume() pair)
+ * - In interrupt context
+ */
static inline void lazy_mmu_mode_pause(void)
{
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
if (in_interrupt())
return;
- arch_leave_lazy_mmu_mode();
+ VM_WARN_ON_ONCE(state->pause_count == U8_MAX);
+
+ if (state->pause_count++ == 0 && state->enable_count > 0)
+ arch_leave_lazy_mmu_mode();
}
+/**
+ * lazy_mmu_mode_pause() - Resume the lazy MMU mode.
+ *
+ * Resumes the lazy MMU mode; if it was active at the point where the matching
+ * call to lazy_mmu_mode_pause() was made, re-enables it and calls
+ * arch_enter_lazy_mmu_mode().
+ *
+ * Must match a call to lazy_mmu_mode_pause().
+ *
+ * Has no effect if called:
+ * - While paused (inside another pause()/resume() pair)
+ * - In interrupt context
+ */
static inline void lazy_mmu_mode_resume(void)
{
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
if (in_interrupt())
return;
- arch_enter_lazy_mmu_mode();
+ VM_WARN_ON_ONCE(state->pause_count == 0);
+
+ if (--state->pause_count == 0 && state->enable_count > 0)
+ arch_enter_lazy_mmu_mode();
}
#else
static inline void lazy_mmu_mode_enable(void) {}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878de25c..847e242376db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1441,6 +1441,10 @@ struct task_struct {
struct page_frag task_frag;
+#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+ struct lazy_mmu_state lazy_mmu_state;
+#endif
+
#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
#endif
@@ -1724,6 +1728,47 @@ static inline char task_state_to_char(struct task_struct *tsk)
return task_index_to_char(task_state_index(tsk));
}
+#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
+/**
+ * __task_lazy_mmu_mode_active() - Test the lazy MMU mode state for a task.
+ * @tsk: The task to check.
+ *
+ * Test whether @tsk has its lazy MMU mode state set to active (i.e. enabled
+ * and not paused).
+ *
+ * This function only considers the state saved in task_struct; to test whether
+ * current actually is in lazy MMU mode, in_lazy_mmu_mode() should be used
+ * instead.
+ *
+ * This function is intended for architectures that implement the lazy MMU
+ * mode; it must not be called from generic code.
+ */
+static inline bool __task_lazy_mmu_mode_active(struct task_struct *tsk)
+{
+ struct lazy_mmu_state *state = &tsk->lazy_mmu_state;
+
+ return state->enable_count > 0 && state->pause_count == 0;
+}
+
+/**
+ * in_lazy_mmu_mode() - Test whether we are currently in lazy MMU mode.
+ *
+ * Test whether the current context is in lazy MMU mode. This is true if both:
+ * 1. We are not in interrupt context
+ * 2. Lazy MMU mode is active for the current task
+ *
+ * This function is intended for architectures that implement the lazy MMU
+ * mode; it must not be called from generic code.
+ */
+static inline bool in_lazy_mmu_mode(void)
+{
+ if (in_interrupt())
+ return false;
+
+ return __task_lazy_mmu_mode_active(current);
+}
+#endif
+
extern struct pid *cad_pid;
/*
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (7 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 08/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 14:10 ` David Hildenbrand (Red Hat)
2025-12-04 6:52 ` Anshuman Khandual
2025-11-24 13:22 ` [PATCH v5 10/12] powerpc/mm: replace batch->active " Kevin Brodsky
` (3 subsequent siblings)
12 siblings, 2 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. As a result we no longer need a TIF flag for that purpose -
let's use the new in_lazy_mmu_mode() helper instead.
The explicit check for in_interrupt() is no longer necessary either
as in_lazy_mmu_mode() always returns false in interrupt context.
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/arm64/include/asm/pgtable.h | 19 +++----------------
arch/arm64/include/asm/thread_info.h | 3 +--
2 files changed, 4 insertions(+), 18 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index a7d99dee3dc4..dd7ed653a20d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -62,28 +62,16 @@ static inline void emit_pte_barriers(void)
static inline void queue_pte_barriers(void)
{
- unsigned long flags;
-
- if (in_interrupt()) {
- emit_pte_barriers();
- return;
- }
-
- flags = read_thread_flags();
-
- if (flags & BIT(TIF_LAZY_MMU)) {
+ if (in_lazy_mmu_mode()) {
/* Avoid the atomic op if already set. */
- if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
+ if (!test_thread_flag(TIF_LAZY_MMU_PENDING))
set_thread_flag(TIF_LAZY_MMU_PENDING);
} else {
emit_pte_barriers();
}
}
-static inline void arch_enter_lazy_mmu_mode(void)
-{
- set_thread_flag(TIF_LAZY_MMU);
-}
+static inline void arch_enter_lazy_mmu_mode(void) {}
static inline void arch_flush_lazy_mmu_mode(void)
{
@@ -94,7 +82,6 @@ static inline void arch_flush_lazy_mmu_mode(void)
static inline void arch_leave_lazy_mmu_mode(void)
{
arch_flush_lazy_mmu_mode();
- clear_thread_flag(TIF_LAZY_MMU);
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
index f241b8601ebd..4ff8da0767d9 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -84,8 +84,7 @@ void arch_setup_new_exec(void);
#define TIF_SME_VL_INHERIT 28 /* Inherit SME vl_onexec across exec */
#define TIF_KERNEL_FPSTATE 29 /* Task is in a kernel mode FPSIMD section */
#define TIF_TSC_SIGSEGV 30 /* SIGSEGV on counter-timer access */
-#define TIF_LAZY_MMU 31 /* Task in lazy mmu mode */
-#define TIF_LAZY_MMU_PENDING 32 /* Ops pending for lazy mmu mode exit */
+#define TIF_LAZY_MMU_PENDING 31 /* Ops pending for lazy mmu mode exit */
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 10/12] powerpc/mm: replace batch->active with in_lazy_mmu_mode()
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (8 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 11/12] sparc/mm: " Kevin Brodsky
` (2 subsequent siblings)
12 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
A per-CPU batch struct is activated when entering lazy MMU mode; its
lifetime is the same as the lazy MMU section (it is deactivated when
leaving the mode). Preemption is disabled in that interval to ensure
that the per-CPU reference remains valid.
The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. We can therefore use the generic helper in_lazy_mmu_mode()
to tell whether a batch struct is active instead of tracking it
explicitly.
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 9 ---------
arch/powerpc/mm/book3s64/hash_tlb.c | 2 +-
2 files changed, 1 insertion(+), 10 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 565c1b7c3eae..6cc9abcd7b3d 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -12,7 +12,6 @@
#define PPC64_TLB_BATCH_NR 192
struct ppc64_tlb_batch {
- int active;
unsigned long index;
struct mm_struct *mm;
real_pte_t pte[PPC64_TLB_BATCH_NR];
@@ -26,8 +25,6 @@ extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
static inline void arch_enter_lazy_mmu_mode(void)
{
- struct ppc64_tlb_batch *batch;
-
if (radix_enabled())
return;
/*
@@ -35,8 +32,6 @@ static inline void arch_enter_lazy_mmu_mode(void)
* operating on kernel page tables.
*/
preempt_disable();
- batch = this_cpu_ptr(&ppc64_tlb_batch);
- batch->active = 1;
}
static inline void arch_flush_lazy_mmu_mode(void)
@@ -53,14 +48,10 @@ static inline void arch_flush_lazy_mmu_mode(void)
static inline void arch_leave_lazy_mmu_mode(void)
{
- struct ppc64_tlb_batch *batch;
-
if (radix_enabled())
return;
- batch = this_cpu_ptr(&ppc64_tlb_batch);
arch_flush_lazy_mmu_mode();
- batch->active = 0;
preempt_enable();
}
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 787f7a0e27f0..72b83f582b6d 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -100,7 +100,7 @@ void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
* Check if we have an active batch on this CPU. If not, just
* flush now and return.
*/
- if (!batch->active) {
+ if (!in_lazy_mmu_mode()) {
flush_hash_page(vpn, rpte, psize, ssize, mm_is_thread_local(mm));
put_cpu_var(ppc64_tlb_batch);
return;
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 11/12] sparc/mm: replace batch->active with in_lazy_mmu_mode()
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (9 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 10/12] powerpc/mm: replace batch->active " Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 12/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
2025-12-03 16:08 ` [PATCH v5 00/12] Nesting support for lazy MMU mode Venkat
12 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86,
David Hildenbrand (Red Hat)
A per-CPU batch struct is activated when entering lazy MMU mode; its
lifetime is the same as the lazy MMU section (it is deactivated when
leaving the mode). Preemption is disabled in that interval to ensure
that the per-CPU reference remains valid.
The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode. We can therefore use the generic helper in_lazy_mmu_mode()
to tell whether a batch struct is active instead of tracking it
explicitly.
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/sparc/include/asm/tlbflush_64.h | 1 -
arch/sparc/mm/tlb.c | 9 +--------
2 files changed, 1 insertion(+), 9 deletions(-)
diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
index 4e1036728e2f..6133306ba59a 100644
--- a/arch/sparc/include/asm/tlbflush_64.h
+++ b/arch/sparc/include/asm/tlbflush_64.h
@@ -12,7 +12,6 @@ struct tlb_batch {
unsigned int hugepage_shift;
struct mm_struct *mm;
unsigned long tlb_nr;
- unsigned long active;
unsigned long vaddrs[TLB_BATCH_NR];
};
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 7b5dfcdb1243..879e22c86e5c 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -52,11 +52,7 @@ void flush_tlb_pending(void)
void arch_enter_lazy_mmu_mode(void)
{
- struct tlb_batch *tb;
-
preempt_disable();
- tb = this_cpu_ptr(&tlb_batch);
- tb->active = 1;
}
void arch_flush_lazy_mmu_mode(void)
@@ -69,10 +65,7 @@ void arch_flush_lazy_mmu_mode(void)
void arch_leave_lazy_mmu_mode(void)
{
- struct tlb_batch *tb = this_cpu_ptr(&tlb_batch);
-
arch_flush_lazy_mmu_mode();
- tb->active = 0;
preempt_enable();
}
@@ -93,7 +86,7 @@ static void tlb_batch_add_one(struct mm_struct *mm, unsigned long vaddr,
nr = 0;
}
- if (!tb->active) {
+ if (!in_lazy_mmu_mode()) {
flush_tsb_user_page(mm, vaddr, hugepage_shift);
global_flush_tlb_page(mm, vaddr);
goto out;
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v5 12/12] x86/xen: use lazy_mmu_state when context-switching
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (10 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 11/12] sparc/mm: " Kevin Brodsky
@ 2025-11-24 13:22 ` Kevin Brodsky
2025-11-24 14:18 ` David Hildenbrand (Red Hat)
2025-11-25 13:39 ` Jürgen Groß
2025-12-03 16:08 ` [PATCH v5 00/12] Nesting support for lazy MMU mode Venkat
12 siblings, 2 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-24 13:22 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
We currently set a TIF flag when scheduling out a task that is in
lazy MMU mode, in order to restore it when the task is scheduled
again.
The generic lazy_mmu layer now tracks whether a task is in lazy MMU
mode in task_struct::lazy_mmu_state. We can therefore check that
state when switching to the new task, instead of using a separate
TIF flag.
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
arch/x86/include/asm/thread_info.h | 4 +---
arch/x86/xen/enlighten_pv.c | 3 +--
2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e71e0e8362ed..0067684afb5b 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -100,8 +100,7 @@ struct thread_info {
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_SINGLESTEP 25 /* reenable singlestep on user return*/
#define TIF_BLOCKSTEP 26 /* set when we want DEBUGCTLMSR_BTF */
-#define TIF_LAZY_MMU_UPDATES 27 /* task is updating the mmu lazily */
-#define TIF_ADDR32 28 /* 32-bit address space on 64 bits */
+#define TIF_ADDR32 27 /* 32-bit address space on 64 bits */
#define _TIF_SSBD BIT(TIF_SSBD)
#define _TIF_SPEC_IB BIT(TIF_SPEC_IB)
@@ -114,7 +113,6 @@ struct thread_info {
#define _TIF_FORCED_TF BIT(TIF_FORCED_TF)
#define _TIF_BLOCKSTEP BIT(TIF_BLOCKSTEP)
#define _TIF_SINGLESTEP BIT(TIF_SINGLESTEP)
-#define _TIF_LAZY_MMU_UPDATES BIT(TIF_LAZY_MMU_UPDATES)
#define _TIF_ADDR32 BIT(TIF_ADDR32)
/* flags to check in __switch_to() */
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 4806cc28d7ca..98dbb6a61087 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -426,7 +426,6 @@ static void xen_start_context_switch(struct task_struct *prev)
if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) {
arch_leave_lazy_mmu_mode();
- set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU_UPDATES);
}
enter_lazy(XEN_LAZY_CPU);
}
@@ -437,7 +436,7 @@ static void xen_end_context_switch(struct task_struct *next)
xen_mc_flush();
leave_lazy(XEN_LAZY_CPU);
- if (test_and_clear_ti_thread_flag(task_thread_info(next), TIF_LAZY_MMU_UPDATES))
+ if (__task_lazy_mmu_mode_active(next))
arch_enter_lazy_mmu_mode();
}
--
2.51.2
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-11-24 13:22 ` [PATCH v5 08/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
@ 2025-11-24 14:09 ` David Hildenbrand (Red Hat)
2025-11-27 12:33 ` Alexander Gordeev
` (2 subsequent siblings)
3 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 14:09 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 11/24/25 14:22, Kevin Brodsky wrote:
> Despite recent efforts to prevent lazy_mmu sections from nesting, it
> remains difficult to ensure that it never occurs - and in fact it
> does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> made nesting tolerable on arm64, but without truly supporting it:
> the inner call to leave() disables the batching optimisation before
> the outer section ends.
>
> This patch actually enables lazy_mmu sections to nest by tracking
> the nesting level in task_struct, in a similar fashion to e.g.
> pagefault_{enable,disable}(). This is fully handled by the generic
> lazy_mmu helpers that were recently introduced.
>
> lazy_mmu sections were not initially intended to nest, so we need to
> clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
> This patch takes the following approach:
>
> * The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
> calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
>
> * Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
> to the arch via arch_{enter,leave} - lazy MMU remains enabled so
> the assumption is that these callbacks are not relevant. However,
> existing code may rely on a call to disable() to flush any batched
> state, regardless of nesting. arch_flush_lazy_mmu_mode() is
> therefore called in that situation.
>
> A separate interface was recently introduced to temporarily pause
> the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
> exits the mode *regardless of the nesting level*, and resume()
> restores the mode at the same nesting level.
>
> pause()/resume() are themselves allowed to nest, so we actually
> store two nesting levels in task_struct: enable_count and
> pause_count. A new helper in_lazy_mmu_mode() is introduced to
> determine whether we are currently in lazy MMU mode; this will be
> used in subsequent patches to replace the various ways arch's
> currently track whether the mode is enabled.
>
> In summary (enable/pause represent the values *after* the call):
>
> lazy_mmu_mode_enable() -> arch_enter() enable=1 pause=0
> lazy_mmu_mode_enable() -> ø enable=2 pause=0
> lazy_mmu_mode_pause() -> arch_leave() enable=2 pause=1
> lazy_mmu_mode_resume() -> arch_enter() enable=2 pause=0
> lazy_mmu_mode_disable() -> arch_flush() enable=1 pause=0
> lazy_mmu_mode_disable() -> arch_leave() enable=0 pause=0
>
> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
> headers included by <linux/pgtable.h> to use it.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Nothing jumped at me, so
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Hoping we can get some more eyes to have a look.
--
Cheers
David
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
2025-11-24 13:22 ` [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
@ 2025-11-24 14:10 ` David Hildenbrand (Red Hat)
2025-12-04 6:52 ` Anshuman Khandual
1 sibling, 0 replies; 40+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 14:10 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 11/24/25 14:22, Kevin Brodsky wrote:
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. As a result we no longer need a TIF flag for that purpose -
> let's use the new in_lazy_mmu_mode() helper instead.
>
> The explicit check for in_interrupt() is no longer necessary either
> as in_lazy_mmu_mode() always returns false in interrupt context.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
Nothing jumped at me
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 07/12] mm: bail out of lazy_mmu_mode_* in interrupt context
2025-11-24 13:22 ` [PATCH v5 07/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
@ 2025-11-24 14:11 ` David Hildenbrand (Red Hat)
2025-12-04 4:34 ` Anshuman Khandual
1 sibling, 0 replies; 40+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 14:11 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 11/24/25 14:22, Kevin Brodsky wrote:
> The lazy MMU mode cannot be used in interrupt context. This is
> documented in <linux/pgtable.h>, but isn't consistently handled
> across architectures.
>
> arm64 ensures that calls to lazy_mmu_mode_* have no effect in
> interrupt context, because such calls do occur in certain
> configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier
> batching in interrupt contexts"). Other architectures do not check
> this situation, most likely because it hasn't occurred so far.
>
> Let's handle this in the new generic lazy_mmu layer, in the same
> fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt().
> Also remove the arm64 handling that is now redundant.
>
> Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is
> disabled while in interrupt (see queue_pte_barriers() and
> xen_get_lazy_mode() respectively). This will be handled in the
> generic layer in a subsequent patch.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
(resending as I pushed the wrong button there ...)
Moving this patch earlier LGTM, hoping we don't get any unexpected
surprises ...
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 12/12] x86/xen: use lazy_mmu_state when context-switching
2025-11-24 13:22 ` [PATCH v5 12/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
@ 2025-11-24 14:18 ` David Hildenbrand (Red Hat)
2025-11-25 13:39 ` Jürgen Groß
1 sibling, 0 replies; 40+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-24 14:18 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 11/24/25 14:22, Kevin Brodsky wrote:
> We currently set a TIF flag when scheduling out a task that is in
> lazy MMU mode, in order to restore it when the task is scheduled
> again.
>
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode in task_struct::lazy_mmu_state. We can therefore check that
> state when switching to the new task, instead of using a separate
> TIF flag.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
Nothing jumped at me, hoping for another pair of eyes from the XEN folks
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
--
Cheers
David
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 12/12] x86/xen: use lazy_mmu_state when context-switching
2025-11-24 13:22 ` [PATCH v5 12/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
2025-11-24 14:18 ` David Hildenbrand (Red Hat)
@ 2025-11-25 13:39 ` Jürgen Groß
1 sibling, 0 replies; 40+ messages in thread
From: Jürgen Groß @ 2025-11-25 13:39 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
[-- Attachment #1.1.1: Type: text/plain, Size: 534 bytes --]
On 24.11.25 14:22, Kevin Brodsky wrote:
> We currently set a TIF flag when scheduling out a task that is in
> lazy MMU mode, in order to restore it when the task is scheduled
> again.
>
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode in task_struct::lazy_mmu_state. We can therefore check that
> state when switching to the new task, instead of using a separate
> TIF flag.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Juergen
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-11-24 13:22 ` [PATCH v5 08/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
2025-11-24 14:09 ` David Hildenbrand (Red Hat)
@ 2025-11-27 12:33 ` Alexander Gordeev
2025-11-27 12:45 ` Kevin Brodsky
2025-11-28 13:55 ` Alexander Gordeev
2025-12-04 6:23 ` Anshuman Khandual
3 siblings, 1 reply; 40+ messages in thread
From: Alexander Gordeev @ 2025-11-27 12:33 UTC (permalink / raw)
To: Kevin Brodsky
Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On Mon, Nov 24, 2025 at 01:22:24PM +0000, Kevin Brodsky wrote:
Hi Kevin,
...
> +/**
> + * lazy_mmu_mode_pause() - Pause the lazy MMU mode.
> + *
> + * Pauses the lazy MMU mode; if it is currently active, disables it and calls
> + * arch_leave_lazy_mmu_mode().
> + *
> + * Must be paired with a call to lazy_mmu_mode_resume(). Calls to the
> + * lazy_mmu_mode_* API have no effect until the matching resume() call.
Sorry if it was discussed already - if yes, I somehow missed it. If I read
the whole thing correctly enter()/pause() interleaving is not forbidden?
lazy_mmu_mode_enable()
lazy_mmu_mode_pause()
lazy_mmu_mode_enable()
...
lazy_mmu_mode_disable()
lazy_mmu_mode_resume()
lazy_mmu_mode_disable()
> + *
> + * Has no effect if called:
> + * - While paused (inside another pause()/resume() pair)
> + * - In interrupt context
> + */
> static inline void lazy_mmu_mode_pause(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> if (in_interrupt())
> return;
>
> - arch_leave_lazy_mmu_mode();
> + VM_WARN_ON_ONCE(state->pause_count == U8_MAX);
> +
> + if (state->pause_count++ == 0 && state->enable_count > 0)
> + arch_leave_lazy_mmu_mode();
> }
>
> +/**
> + * lazy_mmu_mode_pause() - Resume the lazy MMU mode.
resume() ?
> + *
> + * Resumes the lazy MMU mode; if it was active at the point where the matching
> + * call to lazy_mmu_mode_pause() was made, re-enables it and calls
> + * arch_enter_lazy_mmu_mode().
> + *
> + * Must match a call to lazy_mmu_mode_pause().
> + *
> + * Has no effect if called:
> + * - While paused (inside another pause()/resume() pair)
> + * - In interrupt context
> + */
> static inline void lazy_mmu_mode_resume(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> if (in_interrupt())
> return;
>
> - arch_enter_lazy_mmu_mode();
> + VM_WARN_ON_ONCE(state->pause_count == 0);
> +
> + if (--state->pause_count == 0 && state->enable_count > 0)
> + arch_enter_lazy_mmu_mode();
> }
...
Thanks!
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-11-27 12:33 ` Alexander Gordeev
@ 2025-11-27 12:45 ` Kevin Brodsky
0 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-11-27 12:45 UTC (permalink / raw)
To: Alexander Gordeev
Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 27/11/2025 13:33, Alexander Gordeev wrote:
> On Mon, Nov 24, 2025 at 01:22:24PM +0000, Kevin Brodsky wrote:
>
> Hi Kevin,
>
> ...
>> +/**
>> + * lazy_mmu_mode_pause() - Pause the lazy MMU mode.
>> + *
>> + * Pauses the lazy MMU mode; if it is currently active, disables it and calls
>> + * arch_leave_lazy_mmu_mode().
>> + *
>> + * Must be paired with a call to lazy_mmu_mode_resume(). Calls to the
>> + * lazy_mmu_mode_* API have no effect until the matching resume() call.
> Sorry if it was discussed already - if yes, I somehow missed it. If I read
> the whole thing correctly enter()/pause() interleaving is not forbidden?
Correct, any call inside pause()/resume() is now allowed (but
effectively ignored). See discussion with Ryan in v4 [1].
[1]
https://lore.kernel.org/all/824bf705-e9d6-4eeb-9532-9059fa56427f@arm.com/
> lazy_mmu_mode_enable()
> lazy_mmu_mode_pause()
> lazy_mmu_mode_enable()
> ...
> lazy_mmu_mode_disable()
> lazy_mmu_mode_resume()
> lazy_mmu_mode_disable()
>
>> + *
>> + * Has no effect if called:
>> + * - While paused (inside another pause()/resume() pair)
>> + * - In interrupt context
>> + */
>> static inline void lazy_mmu_mode_pause(void)
>> {
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> if (in_interrupt())
>> return;
>>
>> - arch_leave_lazy_mmu_mode();
>> + VM_WARN_ON_ONCE(state->pause_count == U8_MAX);
>> +
>> + if (state->pause_count++ == 0 && state->enable_count > 0)
>> + arch_leave_lazy_mmu_mode();
>> }
>>
>> +/**
>> + * lazy_mmu_mode_pause() - Resume the lazy MMU mode.
> resume() ?
Good catch! One copy-paste too many...
- Kevin
>> + *
>> + * Resumes the lazy MMU mode; if it was active at the point where the matching
>> + * call to lazy_mmu_mode_pause() was made, re-enables it and calls
>> + * arch_enter_lazy_mmu_mode().
>> + *
>> + * Must match a call to lazy_mmu_mode_pause().
>> + *
>> + * Has no effect if called:
>> + * - While paused (inside another pause()/resume() pair)
>> + * - In interrupt context
>> + */
>> static inline void lazy_mmu_mode_resume(void)
>> {
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> if (in_interrupt())
>> return;
>>
>> - arch_enter_lazy_mmu_mode();
>> + VM_WARN_ON_ONCE(state->pause_count == 0);
>> +
>> + if (--state->pause_count == 0 && state->enable_count > 0)
>> + arch_enter_lazy_mmu_mode();
>> }
> ...
> Thanks!
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers
2025-11-24 13:22 ` [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
@ 2025-11-28 13:50 ` Alexander Gordeev
2025-12-03 8:20 ` Kevin Brodsky
2025-12-04 4:17 ` Anshuman Khandual
1 sibling, 1 reply; 40+ messages in thread
From: Alexander Gordeev @ 2025-11-28 13:50 UTC (permalink / raw)
To: Kevin Brodsky
Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On Mon, Nov 24, 2025 at 01:22:22PM +0000, Kevin Brodsky wrote:
> The implementation of the lazy MMU mode is currently entirely
> arch-specific; core code directly calls arch helpers:
> arch_{enter,leave}_lazy_mmu_mode().
>
> We are about to introduce support for nested lazy MMU sections.
> As things stand we'd have to duplicate that logic in every arch
> implementing lazy_mmu - adding to a fair amount of logic
> already duplicated across lazy_mmu implementations.
>
> This patch therefore introduces a new generic layer that calls the
> existing arch_* helpers. Two pair of calls are introduced:
>
> * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
> This is the standard case where the mode is enabled for a given
> block of code by surrounding it with enable() and disable()
> calls.
>
> * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
> This is for situations where the mode is temporarily disabled
> by first calling pause() and then resume() (e.g. to prevent any
> batching from occurring in a critical section).
>
> The documentation in <linux/pgtable.h> will be updated in a
> subsequent patch.
>
> No functional change should be introduced at this stage.
> The implementation of enable()/resume() and disable()/pause() is
> currently identical, but nesting support will change that.
>
> Most of the call sites have been updated using the following
> Coccinelle script:
>
> @@
> @@
> {
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> ...
> }
>
> @@
> @@
> {
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_pause();
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_resume();
> ...
> }
>
> A couple of notes regarding x86:
>
> * Xen is currently the only case where explicit handling is required
> for lazy MMU when context-switching. This is purely an
> implementation detail and using the generic lazy_mmu_mode_*
> functions would cause trouble when nesting support is introduced,
> because the generic functions must be called from the current task.
> For that reason we still use arch_leave() and arch_enter() there.
>
> * x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
> places, but only defines it if PARAVIRT_XXL is selected, and we
> are removing the fallback in <linux/pgtable.h>. Add a new fallback
> definition to <asm/pgtable.h> to keep things building.
Would it make sense to explicitly describe the policy wrt sleeping while
in lazy MMU mode? If I understand the conclusion of conversation right:
* An arch implementation may disable preemption, but then it is arch
responsibility not to call any arch-specific code that might sleep;
* As result, while in lazy MMU mode the generic code should never
call a code that might sleep;
1. https://lore.kernel.org/linux-mm/b52726c7-ea9c-4743-a68d-3eafce4e5c61@arm.com/
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
...
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-11-24 13:22 ` [PATCH v5 08/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
2025-11-24 14:09 ` David Hildenbrand (Red Hat)
2025-11-27 12:33 ` Alexander Gordeev
@ 2025-11-28 13:55 ` Alexander Gordeev
2025-12-03 8:20 ` Kevin Brodsky
2025-12-04 6:23 ` Anshuman Khandual
3 siblings, 1 reply; 40+ messages in thread
From: Alexander Gordeev @ 2025-11-28 13:55 UTC (permalink / raw)
To: Kevin Brodsky
Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On Mon, Nov 24, 2025 at 01:22:24PM +0000, Kevin Brodsky wrote:
...
> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
> + * A nested call to enable() has no functional effect; however disable() causes
> + * any batched architectural state to be flushed regardless of nesting. After a
> + * call to disable(), the caller can therefore rely on all previous page table
> + * modifications to have taken effect, but the lazy MMU mode may still be
> + * enabled.
> + *
> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
> + * This can be done using:
> + *
> + * lazy_mmu_mode_pause();
> + * <code>
> + * lazy_mmu_mode_resume();
> + *
> + * pause() ensures that the mode is exited regardless of the nesting level;
> + * resume() re-enters the mode at the same nesting level. Any call to the
> + * lazy_mmu_mode_* API between those two calls has no effect. In particular,
> + * this means that pause()/resume() pairs may nest.
> + *
> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> + * currently enabled.
The in_lazy_mmu_mode() name looks ambiguous to me. When the lazy MMU mode
is paused are we still in lazy MMU mode? The __task_lazy_mmu_mode_active()
implementation suggests we are not, while one could still assume we are,
just paused.
Should in_lazy_mmu_mode() be named e.g. as in_active_lazy_mmu_mode() such
a confusion would not occur in the first place.
> */
...
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +/**
> + * __task_lazy_mmu_mode_active() - Test the lazy MMU mode state for a task.
> + * @tsk: The task to check.
> + *
> + * Test whether @tsk has its lazy MMU mode state set to active (i.e. enabled
> + * and not paused).
> + *
> + * This function only considers the state saved in task_struct; to test whether
> + * current actually is in lazy MMU mode, in_lazy_mmu_mode() should be used
> + * instead.
> + *
> + * This function is intended for architectures that implement the lazy MMU
> + * mode; it must not be called from generic code.
> + */
> +static inline bool __task_lazy_mmu_mode_active(struct task_struct *tsk)
> +{
> + struct lazy_mmu_state *state = &tsk->lazy_mmu_state;
> +
> + return state->enable_count > 0 && state->pause_count == 0;
> +}
> +
> +/**
> + * in_lazy_mmu_mode() - Test whether we are currently in lazy MMU mode.
> + *
> + * Test whether the current context is in lazy MMU mode. This is true if both:
> + * 1. We are not in interrupt context
> + * 2. Lazy MMU mode is active for the current task
> + *
> + * This function is intended for architectures that implement the lazy MMU
> + * mode; it must not be called from generic code.
> + */
> +static inline bool in_lazy_mmu_mode(void)
> +{
> + if (in_interrupt())
> + return false;
> +
> + return __task_lazy_mmu_mode_active(current);
> +}
> +#endif
...
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
2025-11-24 13:22 ` [PATCH v5 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
@ 2025-12-01 6:21 ` Anshuman Khandual
2025-12-03 8:19 ` Kevin Brodsky
0 siblings, 1 reply; 40+ messages in thread
From: Anshuman Khandual @ 2025-12-01 6:21 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 24/11/25 6:52 PM, Kevin Brodsky wrote:
> Architectures currently opt in for implementing lazy_mmu helpers by
> defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE.
>
> In preparation for introducing a generic lazy_mmu layer that will
> require storage in task_struct, let's switch to a cleaner approach:
> instead of defining a macro, select a CONFIG option.
>
> This patch introduces CONFIG_ARCH_HAS_LAZY_MMU_MODE and has each
> arch select it when it implements lazy_mmu helpers.
> __HAVE_ARCH_ENTER_LAZY_MMU_MODE is removed and <linux/pgtable.h>
> relies on the new CONFIG instead.
>
> On x86, lazy_mmu helpers are only implemented if PARAVIRT_XXL is
> selected. This creates some complications in arch/x86/boot/, because
> a few files manually undefine PARAVIRT* options. As a result
> <asm/paravirt.h> does not define the lazy_mmu helpers, but this
> breaks the build as <linux/pgtable.h> only defines them if
> !CONFIG_ARCH_HAS_LAZY_MMU_MODE. There does not seem to be a clean
> way out of this - let's just undefine that new CONFIG too.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/pgtable.h | 1 -
> arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 --
> arch/powerpc/platforms/Kconfig.cputype | 1 +
> arch/sparc/Kconfig | 1 +
> arch/sparc/include/asm/tlbflush_64.h | 2 --
> arch/x86/Kconfig | 1 +
> arch/x86/boot/compressed/misc.h | 1 +
> arch/x86/boot/startup/sme.c | 1 +
> arch/x86/include/asm/paravirt.h | 1 -
> include/linux/pgtable.h | 2 +-
> mm/Kconfig | 3 +++
> 12 files changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 6663ffd23f25..74be32f5f446 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -34,6 +34,7 @@ config ARM64
> select ARCH_HAS_KCOV
> select ARCH_HAS_KERNEL_FPU_SUPPORT if KERNEL_MODE_NEON
> select ARCH_HAS_KEEPINITRD
> + select ARCH_HAS_LAZY_MMU_MODE
> select ARCH_HAS_MEMBARRIER_SYNC_CORE
> select ARCH_HAS_MEM_ENCRYPT
> select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0944e296dd4a..54f8d6bb6f22 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -80,7 +80,6 @@ static inline void queue_pte_barriers(void)
> }
> }
>
> -#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> /*
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index 2d45f57df169..565c1b7c3eae 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -24,8 +24,6 @@ DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch);
>
> extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);
>
> -#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> -
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> struct ppc64_tlb_batch *batch;
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 4c321a8ea896..f399917c17bd 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -93,6 +93,7 @@ config PPC_BOOK3S_64
> select IRQ_WORK
> select PPC_64S_HASH_MMU if !PPC_RADIX_MMU
> select KASAN_VMALLOC if KASAN
> + select ARCH_HAS_LAZY_MMU_MODE
>
> config PPC_BOOK3E_64
> bool "Embedded processors"
> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
> index a630d373e645..2bad14744ca4 100644
> --- a/arch/sparc/Kconfig
> +++ b/arch/sparc/Kconfig
> @@ -112,6 +112,7 @@ config SPARC64
> select NEED_PER_CPU_PAGE_FIRST_CHUNK
> select ARCH_SUPPORTS_SCHED_SMT if SMP
> select ARCH_SUPPORTS_SCHED_MC if SMP
> + select ARCH_HAS_LAZY_MMU_MODE
>
> config ARCH_PROC_KCORE_TEXT
> def_bool y
> diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h
> index 925bb5d7a4e1..4e1036728e2f 100644
> --- a/arch/sparc/include/asm/tlbflush_64.h
> +++ b/arch/sparc/include/asm/tlbflush_64.h
> @@ -39,8 +39,6 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
>
> void flush_tlb_kernel_range(unsigned long start, unsigned long end);
>
> -#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> -
> void flush_tlb_pending(void);
> void arch_enter_lazy_mmu_mode(void);
> void arch_flush_lazy_mmu_mode(void);
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index a3700766a8c0..db769c4addf9 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -805,6 +805,7 @@ config PARAVIRT
> config PARAVIRT_XXL
> bool
> depends on X86_64
> + select ARCH_HAS_LAZY_MMU_MODE
>
> config PARAVIRT_DEBUG
> bool "paravirt-ops debugging"
> diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h
> index db1048621ea2..cdd7f692d9ee 100644
> --- a/arch/x86/boot/compressed/misc.h
> +++ b/arch/x86/boot/compressed/misc.h
> @@ -11,6 +11,7 @@
> #undef CONFIG_PARAVIRT
> #undef CONFIG_PARAVIRT_XXL
> #undef CONFIG_PARAVIRT_SPINLOCKS
> +#undef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> #undef CONFIG_KASAN
> #undef CONFIG_KASAN_GENERIC
>
> diff --git a/arch/x86/boot/startup/sme.c b/arch/x86/boot/startup/sme.c
> index e7ea65f3f1d6..b76a7c95dfe1 100644
> --- a/arch/x86/boot/startup/sme.c
> +++ b/arch/x86/boot/startup/sme.c
> @@ -24,6 +24,7 @@
> #undef CONFIG_PARAVIRT
> #undef CONFIG_PARAVIRT_XXL
> #undef CONFIG_PARAVIRT_SPINLOCKS
> +#undef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>
> /*
> * This code runs before CPU feature bits are set. By default, the
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index b5e59a7ba0d0..13f9cd31c8f8 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -526,7 +526,6 @@ static inline void arch_end_context_switch(struct task_struct *next)
> PVOP_VCALL1(cpu.end_context_switch, next);
> }
>
> -#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> PVOP_VCALL0(mmu.lazy_mode.enter);
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b13b6f42be3c..de7d2c7e63eb 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -231,7 +231,7 @@ static inline int pmd_dirty(pmd_t pmd)
> * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> * and the mode cannot be used in interrupt context.
> */
> -#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> +#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> static inline void arch_enter_lazy_mmu_mode(void) {}
> static inline void arch_leave_lazy_mmu_mode(void) {}
> static inline void arch_flush_lazy_mmu_mode(void) {}
> diff --git a/mm/Kconfig b/mm/Kconfig
> index bd0ea5454af8..a7486fae0cd3 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1464,6 +1464,9 @@ config PT_RECLAIM
> config FIND_NORMAL_PAGE
> def_bool n
>
> +config ARCH_HAS_LAZY_MMU_MODE
> + bool
> +
Might be worth adding a help description for the new config option.
> source "mm/damon/Kconfig"
>
> endmenu
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
2025-12-01 6:21 ` Anshuman Khandual
@ 2025-12-03 8:19 ` Kevin Brodsky
0 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-12-03 8:19 UTC (permalink / raw)
To: Anshuman Khandual, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 01/12/2025 07:21, Anshuman Khandual wrote:
>> +config ARCH_HAS_LAZY_MMU_MODE
>> + bool
>> +
> Might be worth adding a help description for the new config option.
Sure, would be an occasion to clarify which functions an arch needs to
define.
- Kevin
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers
2025-11-28 13:50 ` Alexander Gordeev
@ 2025-12-03 8:20 ` Kevin Brodsky
0 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-12-03 8:20 UTC (permalink / raw)
To: Alexander Gordeev
Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 28/11/2025 14:50, Alexander Gordeev wrote:
> Would it make sense to explicitly describe the policy wrt sleeping while
> in lazy MMU mode? If I understand the conclusion of conversation right:
>
> * An arch implementation may disable preemption, but then it is arch
> responsibility not to call any arch-specific code that might sleep;
> * As result, while in lazy MMU mode the generic code should never
> call a code that might sleep;
I think that's a good summary, and I agree that the second point is not
obvious from the comment in <linux/pgtable.h>. This series it not making
any change in that respect, but I'll add a clarification in this patch
(or a separate patch).
- Kevin
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-11-28 13:55 ` Alexander Gordeev
@ 2025-12-03 8:20 ` Kevin Brodsky
2025-12-04 5:25 ` Anshuman Khandual
0 siblings, 1 reply; 40+ messages in thread
From: Kevin Brodsky @ 2025-12-03 8:20 UTC (permalink / raw)
To: Alexander Gordeev
Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 28/11/2025 14:55, Alexander Gordeev wrote:
>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>> + * currently enabled.
> The in_lazy_mmu_mode() name looks ambiguous to me. When the lazy MMU mode
> is paused are we still in lazy MMU mode? The __task_lazy_mmu_mode_active()
> implementation suggests we are not, while one could still assume we are,
> just paused.
>
> Should in_lazy_mmu_mode() be named e.g. as in_active_lazy_mmu_mode() such
> a confusion would not occur in the first place.
I see your point, how about is_lazy_mmu_mode_active()?
- Kevin
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 00/12] Nesting support for lazy MMU mode
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
` (11 preceding siblings ...)
2025-11-24 13:22 ` [PATCH v5 12/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
@ 2025-12-03 16:08 ` Venkat
2025-12-05 13:00 ` Kevin Brodsky
12 siblings, 1 reply; 40+ messages in thread
From: Venkat @ 2025-12-03 16:08 UTC (permalink / raw)
To: Kevin Brodsky
Cc: linux-mm, LKML, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
linuxppc-dev, sparclinux, xen-devel, x86
> On 24 Nov 2025, at 6:52 PM, Kevin Brodsky <kevin.brodsky@arm.com> wrote:
>
> When the lazy MMU mode was introduced eons ago, it wasn't made clear
> whether such a sequence was legal:
>
> arch_enter_lazy_mmu_mode()
> ...
> arch_enter_lazy_mmu_mode()
> ...
> arch_leave_lazy_mmu_mode()
> ...
> arch_leave_lazy_mmu_mode()
>
> It seems fair to say that nested calls to
> arch_{enter,leave}_lazy_mmu_mode() were not expected, and most
> architectures never explicitly supported it.
>
> Nesting does in fact occur in certain configurations, and avoiding it
> has proved difficult. This series therefore enables lazy_mmu sections to
> nest, on all architectures.
>
> Nesting is handled using a counter in task_struct (patch 8), like other
> stateless APIs such as pagefault_{disable,enable}(). This is fully
> handled in a new generic layer in <linux/pgtable.h>; the arch_* API
> remains unchanged. A new pair of calls, lazy_mmu_mode_{pause,resume}(),
> is also introduced to allow functions that are called with the lazy MMU
> mode enabled to temporarily pause it, regardless of nesting.
>
> An arch now opts in to using the lazy MMU mode by selecting
> CONFIG_ARCH_LAZY_MMU; this is more appropriate now that we have a
> generic API, especially with state conditionally added to task_struct.
>
> ---
>
> Background: Ryan Roberts' series from March [1] attempted to prevent
> nesting from ever occurring, and mostly succeeded. Unfortunately, a
> corner case (DEBUG_PAGEALLOC) may still cause nesting to occur on arm64.
> Ryan proposed [2] to address that corner case at the generic level but
> this approach received pushback; [3] then attempted to solve the issue
> on arm64 only, but it was deemed too fragile.
>
> It feels generally difficult to guarantee that lazy_mmu sections don't
> nest, because callers of various standard mm functions do not know if
> the function uses lazy_mmu itself.
>
> The overall approach in v3/v4 is very close to what David Hildenbrand
> proposed on v2 [4].
>
> Unlike in v1/v2, no special provision is made for architectures to
> save/restore extra state when entering/leaving the mode. Based on the
> discussions so far, this does not seem to be required - an arch can
> store any relevant state in thread_struct during arch_enter() and
> restore it in arch_leave(). Nesting is not a concern as these functions
> are only called at the top level, not in nested sections.
>
> The introduction of a generic layer, and tracking of the lazy MMU state
> in task_struct, also allows to streamline the arch callbacks - this
> series removes 67 lines from arch/.
>
> Patch overview:
>
> * Patch 1: cleanup - avoids having to deal with the powerpc
> context-switching code
>
> * Patch 2-4: prepare arch_flush_lazy_mmu_mode() to be called from the
> generic layer (patch 8)
>
> * Patch 5-6: new API + CONFIG_ARCH_LAZY_MMU
>
> * Patch 7: ensure correctness in interrupt context
>
> * Patch 8: nesting support
>
> * Patch 9-12: replace arch-specific tracking of lazy MMU mode with
> generic API
>
> This series has been tested by running the mm kselftests on arm64 with
> DEBUG_VM, DEBUG_PAGEALLOC, KFENCE and KASAN. It was also build-tested on
> other architectures (with and without XEN_PV on x86).
>
> - Kevin
>
> [1] https://lore.kernel.org/all/20250303141542.3371656-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/all/20250530140446.2387131-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/all/20250606135654.178300-1-ryan.roberts@arm.com/
> [4] https://lore.kernel.org/all/ef343405-c394-4763-a79f-21381f217b6c@redhat.com/
> ---
> Changelog
>
> v4..v5:
>
> - Rebased on mm-unstable
> - Patch 3: added missing radix_enabled() check in arch_flush()
> [Ritesh Harjani]
> - Patch 6: declare arch_flush_lazy_mmu_mode() as static inline on x86
> [Ryan Roberts]
> - Patch 7 (formerly 12): moved before patch 8 to ensure correctness in
> interrupt context [Ryan]. The diffs in in_lazy_mmu_mode() and
> queue_pte_barriers() are moved to patch 8 and 9 resp.
> - Patch 8:
> * Removed all restrictions regarding lazy_mmu_mode_{pause,resume}().
> They may now be called even when lazy MMU isn't enabled, and
> any call to lazy_mmu_mode_* may be made while paused (such calls
> will be ignored). [David, Ryan]
> * lazy_mmu_state.{nesting_level,active} are replaced with
> {enable_count,pause_count} to track arbitrary nesting of both
> enable/disable and pause/resume [Ryan]
> * Added __task_lazy_mmu_mode_active() for use in patch 12 [David]
> * Added documentation for all the functions [Ryan]
> - Patch 9: keep existing test + set TIF_LAZY_MMU_PENDING instead of
> atomic RMW [David, Ryan]
> - Patch 12: use __task_lazy_mmu_mode_active() instead of accessing
> lazy_mmu_state directly [David]
> - Collected R-b/A-b tags
>
> v4: https://lore.kernel.org/all/20251029100909.3381140-1-kevin.brodsky@arm.com/
>
> v3..v4:
>
> - Patch 2: restored ordering of preempt_{disable,enable}() [Dave Hansen]
> - Patch 5 onwards: s/ARCH_LAZY_MMU/ARCH_HAS_LAZY_MMU_MODE/ [Mike Rapoport]
> - Patch 7: renamed lazy_mmu_state members, removed VM_BUG_ON(),
> reordered writes to lazy_mmu_state members [David Hildenbrand]
> - Dropped patch 13 as it doesn't seem justified [David H]
> - Various improvements to commit messages [David H]
>
> v3: https://lore.kernel.org/all/20251015082727.2395128-1-kevin.brodsky@arm.com/
>
> v2..v3:
>
> - Full rewrite; dropped all Acked-by/Reviewed-by.
> - Rebased on v6.18-rc1.
>
> v2: https://lore.kernel.org/all/20250908073931.4159362-1-kevin.brodsky@arm.com/
>
> v1..v2:
> - Rebased on mm-unstable.
> - Patch 2: handled new calls to enter()/leave(), clarified how the "flush"
> pattern (leave() followed by enter()) is handled.
> - Patch 5,6: removed unnecessary local variable [Alexander Gordeev's
> suggestion].
> - Added Mike Rapoport's Acked-by.
>
> v1: https://lore.kernel.org/all/20250904125736.3918646-1-kevin.brodsky@arm.com/
> ---
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Andreas Larsson <andreas@gaisler.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: David Woodhouse <dwmw2@infradead.org>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yeoreum Yun <yeoreum.yun@arm.com>
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: sparclinux@vger.kernel.org
> Cc: xen-devel@lists.xenproject.org
> Cc: x86@kernel.org
> ---
> Alexander Gordeev (1):
> powerpc/64s: Do not re-activate batched TLB flush
>
> Kevin Brodsky (11):
> x86/xen: simplify flush_lazy_mmu()
> powerpc/mm: implement arch_flush_lazy_mmu_mode()
> sparc/mm: implement arch_flush_lazy_mmu_mode()
> mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE
> mm: introduce generic lazy_mmu helpers
> mm: bail out of lazy_mmu_mode_* in interrupt context
> mm: enable lazy_mmu sections to nest
> arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
> powerpc/mm: replace batch->active with in_lazy_mmu_mode()
> sparc/mm: replace batch->active with in_lazy_mmu_mode()
> x86/xen: use lazy_mmu_state when context-switching
>
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/pgtable.h | 41 +----
> arch/arm64/include/asm/thread_info.h | 3 +-
> arch/arm64/mm/mmu.c | 4 +-
> arch/arm64/mm/pageattr.c | 4 +-
> .../include/asm/book3s/64/tlbflush-hash.h | 20 ++-
> arch/powerpc/include/asm/thread_info.h | 2 -
> arch/powerpc/kernel/process.c | 25 ---
> arch/powerpc/mm/book3s64/hash_tlb.c | 10 +-
> arch/powerpc/mm/book3s64/subpage_prot.c | 4 +-
> arch/powerpc/platforms/Kconfig.cputype | 1 +
> arch/sparc/Kconfig | 1 +
> arch/sparc/include/asm/tlbflush_64.h | 5 +-
> arch/sparc/mm/tlb.c | 14 +-
> arch/x86/Kconfig | 1 +
> arch/x86/boot/compressed/misc.h | 1 +
> arch/x86/boot/startup/sme.c | 1 +
> arch/x86/include/asm/paravirt.h | 1 -
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/include/asm/thread_info.h | 4 +-
> arch/x86/xen/enlighten_pv.c | 3 +-
> arch/x86/xen/mmu_pv.c | 6 +-
> fs/proc/task_mmu.c | 4 +-
> include/linux/mm_types_task.h | 5 +
> include/linux/pgtable.h | 147 +++++++++++++++++-
> include/linux/sched.h | 45 ++++++
> mm/Kconfig | 3 +
> mm/kasan/shadow.c | 8 +-
> mm/madvise.c | 18 +--
> mm/memory.c | 16 +-
> mm/migrate_device.c | 8 +-
> mm/mprotect.c | 4 +-
> mm/mremap.c | 4 +-
> mm/userfaultfd.c | 4 +-
> mm/vmalloc.c | 12 +-
> mm/vmscan.c | 12 +-
> 36 files changed, 282 insertions(+), 161 deletions(-)
Tested this patch series by applying on top of mm-unstable, on both HASH and RADIX MMU, and all tests are passed on both MMU’s.
Ran: cache_shape, copyloops, mm from linux source, selftests/powerpc/ and ran memory-hotplug from selftests/. Also ran below tests from avocado misc-test repo.
Link to repo: https://github.com/avocado-framework-tests/avocado-misc-tests
avocado-misc-tests/memory/stutter.py
avocado-misc-tests/memory/eatmemory.py
avocado-misc-tests/memory/hugepage_sanity.py
avocado-misc-tests/memory/fork_mem.py
avocado-misc-tests/memory/memory_api.py
avocado-misc-tests/memory/mprotect.py
avocado-misc-tests/memory/vatest.py avocado-misc-tests/memory/vatest.py.data/vatest.yaml
avocado-misc-tests/memory/transparent_hugepages.py
avocado-misc-tests/memory/transparent_hugepages_swapping.py
avocado-misc-tests/memory/transparent_hugepages_defrag.py
avocado-misc-tests/memory/ksm_poison.py
If its good enough, please add below tag for PowerPC changes.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Regards,
Venkat.
>
>
> base-commit: 1f1edd95f9231ba58a1e535b10200cb1eeaf1f67
> --
> 2.51.2
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 02/12] x86/xen: simplify flush_lazy_mmu()
2025-11-24 13:22 ` [PATCH v5 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
@ 2025-12-04 3:36 ` Anshuman Khandual
0 siblings, 0 replies; 40+ messages in thread
From: Anshuman Khandual @ 2025-12-04 3:36 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 24/11/25 6:52 PM, Kevin Brodsky wrote:
> arch_flush_lazy_mmu_mode() is called when outstanding batched
> pgtable operations must be completed immediately. There should
> however be no need to leave and re-enter lazy MMU completely. The
> only part of that sequence that we really need is xen_mc_flush();
> call it directly.
>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
> Reviewed-by: Juergen Gross <jgross@suse.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> arch/x86/xen/mmu_pv.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 2a4a8deaf612..7a35c3393df4 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2139,10 +2139,8 @@ static void xen_flush_lazy_mmu(void)
> {
> preempt_disable();
>
> - if (xen_get_lazy_mode() == XEN_LAZY_MMU) {
> - arch_leave_lazy_mmu_mode();
> - arch_enter_lazy_mmu_mode();
> - }
> + if (xen_get_lazy_mode() == XEN_LAZY_MMU)
> + xen_mc_flush();
>
> preempt_enable();
> }
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers
2025-11-24 13:22 ` [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
2025-11-28 13:50 ` Alexander Gordeev
@ 2025-12-04 4:17 ` Anshuman Khandual
2025-12-05 12:47 ` Kevin Brodsky
1 sibling, 1 reply; 40+ messages in thread
From: Anshuman Khandual @ 2025-12-04 4:17 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 24/11/25 6:52 PM, Kevin Brodsky wrote:
> The implementation of the lazy MMU mode is currently entirely
> arch-specific; core code directly calls arch helpers:
> arch_{enter,leave}_lazy_mmu_mode().
>
> We are about to introduce support for nested lazy MMU sections.
> As things stand we'd have to duplicate that logic in every arch
> implementing lazy_mmu - adding to a fair amount of logic
> already duplicated across lazy_mmu implementations.
>
> This patch therefore introduces a new generic layer that calls the
> existing arch_* helpers. Two pair of calls are introduced:
>
> * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
> This is the standard case where the mode is enabled for a given
> block of code by surrounding it with enable() and disable()
> calls.
>
> * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
> This is for situations where the mode is temporarily disabled
> by first calling pause() and then resume() (e.g. to prevent any
> batching from occurring in a critical section).
>
> The documentation in <linux/pgtable.h> will be updated in a
> subsequent patch.
> > No functional change should be introduced at this stage.
> The implementation of enable()/resume() and disable()/pause() is
> currently identical, but nesting support will change that.
>
> Most of the call sites have been updated using the following
> Coccinelle script:
>
> @@
> @@
> {
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> ...
> }
>
> @@
> @@
> {
> ...
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_pause();
> ...
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_resume();
> ...
> }
At this point arch_enter/leave_lazy_mmu_mode() helpers are still
present on a given platform but now being called from new generic
helpers lazy_mmu_mode_enable/disable(). Well except x86, there is
direct call sites for those old helpers.
arch/arm64/include/asm/pgtable.h:static inline void arch_enter_lazy_mmu_mode(void)
arch/arm64/include/asm/pgtable.h:static inline void arch_leave_lazy_mmu_mode(void)
arch/arm64/mm/mmu.c: lazy_mmu_mode_enable();
arch/arm64/mm/pageattr.c: lazy_mmu_mode_enable();
arch/arm64/mm/mmu.c: lazy_mmu_mode_disable();
arch/arm64/mm/pageattr.c: lazy_mmu_mode_disable();
>
> A couple of notes regarding x86:
>
> * Xen is currently the only case where explicit handling is required
> for lazy MMU when context-switching. This is purely an
> implementation detail and using the generic lazy_mmu_mode_*
> functions would cause trouble when nesting support is introduced,
> because the generic functions must be called from the current task.
> For that reason we still use arch_leave() and arch_enter() there.
>
> * x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
> places, but only defines it if PARAVIRT_XXL is selected, and we
> are removing the fallback in <linux/pgtable.h>. Add a new fallback
> definition to <asm/pgtable.h> to keep things building.
>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> arch/arm64/mm/mmu.c | 4 ++--
> arch/arm64/mm/pageattr.c | 4 ++--
> arch/powerpc/mm/book3s64/hash_tlb.c | 8 +++----
> arch/powerpc/mm/book3s64/subpage_prot.c | 4 ++--
> arch/x86/include/asm/pgtable.h | 1 +
> fs/proc/task_mmu.c | 4 ++--
> include/linux/pgtable.h | 29 +++++++++++++++++++++----
> mm/kasan/shadow.c | 8 +++----
> mm/madvise.c | 18 +++++++--------
> mm/memory.c | 16 +++++++-------
> mm/migrate_device.c | 8 +++----
> mm/mprotect.c | 4 ++--
> mm/mremap.c | 4 ++--
> mm/userfaultfd.c | 4 ++--
> mm/vmalloc.c | 12 +++++-----
> mm/vmscan.c | 12 +++++-----
> 16 files changed, 81 insertions(+), 59 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 94e29e3574ff..ce66ae77abaa 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -729,7 +729,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> return -EINVAL;
>
> mutex_lock(&pgtable_split_lock);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> /*
> * The split_kernel_leaf_mapping_locked() may sleep, it is not a
> @@ -751,7 +751,7 @@ int split_kernel_leaf_mapping(unsigned long start, unsigned long end)
> ret = split_kernel_leaf_mapping_locked(end);
> }
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> mutex_unlock(&pgtable_split_lock);
> return ret;
> }
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 5135f2d66958..e4059f13c4ed 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -110,7 +110,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
> if (WARN_ON_ONCE(ret))
> return ret;
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> /*
> * The caller must ensure that the range we are operating on does not
> @@ -119,7 +119,7 @@ static int update_range_prot(unsigned long start, unsigned long size,
> */
> ret = walk_kernel_page_table_range_lockless(start, start + size,
> &pageattr_ops, NULL, &data);
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
>
> return ret;
> }
> diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
> index 21fcad97ae80..787f7a0e27f0 100644
> --- a/arch/powerpc/mm/book3s64/hash_tlb.c
> +++ b/arch/powerpc/mm/book3s64/hash_tlb.c
> @@ -205,7 +205,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
> * way to do things but is fine for our needs here.
> */
> local_irq_save(flags);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> for (; start < end; start += PAGE_SIZE) {
> pte_t *ptep = find_init_mm_pte(start, &hugepage_shift);
> unsigned long pte;
> @@ -217,7 +217,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end)
> continue;
> hpte_need_flush(&init_mm, start, ptep, pte, hugepage_shift);
> }
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> local_irq_restore(flags);
> }
>
> @@ -237,7 +237,7 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
> * way to do things but is fine for our needs here.
> */
> local_irq_save(flags);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> start_pte = pte_offset_map(pmd, addr);
> if (!start_pte)
> goto out;
> @@ -249,6 +249,6 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
> }
> pte_unmap(start_pte);
> out:
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> local_irq_restore(flags);
> }
> diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
> index ec98e526167e..07c47673bba2 100644
> --- a/arch/powerpc/mm/book3s64/subpage_prot.c
> +++ b/arch/powerpc/mm/book3s64/subpage_prot.c
> @@ -73,13 +73,13 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
> pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> if (!pte)
> return;
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> for (; npages > 0; --npages) {
> pte_update(mm, addr, pte, 0, 0, 0);
> addr += PAGE_SIZE;
> ++pte;
> }
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(pte - 1, ptl);
> }
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index e33df3da6980..2842fa1f7a2c 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -118,6 +118,7 @@ extern pmdval_t early_pmd_flags;
> #define __pte(x) native_make_pte(x)
>
> #define arch_end_context_switch(prev) do {} while(0)
> +static inline void arch_flush_lazy_mmu_mode(void) {}
> #endif /* CONFIG_PARAVIRT_XXL */
>
> static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index d00ac179d973..ee1778adcc20 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -2737,7 +2737,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> return 0;
> }
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
> /* Fast path for performing exclusive WP */
> @@ -2807,7 +2807,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> if (flush_end)
> flush_tlb_range(vma, start, addr);
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(start_pte, ptl);
>
> cond_resched();
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index de7d2c7e63eb..c121358dba15 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -231,10 +231,31 @@ static inline int pmd_dirty(pmd_t pmd)
> * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> * and the mode cannot be used in interrupt context.
> */
> -#ifndef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> -static inline void arch_enter_lazy_mmu_mode(void) {}
> -static inline void arch_leave_lazy_mmu_mode(void) {}
> -static inline void arch_flush_lazy_mmu_mode(void) {}
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +static inline void lazy_mmu_mode_enable(void)
> +{
> + arch_enter_lazy_mmu_mode();
> +}
> +
> +static inline void lazy_mmu_mode_disable(void)
> +{
> + arch_leave_lazy_mmu_mode();
> +}
> +
> +static inline void lazy_mmu_mode_pause(void)
> +{
> + arch_leave_lazy_mmu_mode();
> +}
> +
> +static inline void lazy_mmu_mode_resume(void)
> +{
> + arch_enter_lazy_mmu_mode();
> +}
> +#else
> +static inline void lazy_mmu_mode_enable(void) {}
> +static inline void lazy_mmu_mode_disable(void) {}
> +static inline void lazy_mmu_mode_pause(void) {}
> +static inline void lazy_mmu_mode_resume(void) {}
> #endif
>
> #ifndef pte_batch_hint
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index 29a751a8a08d..c1433d5cc5db 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> pte_t pte;
> int index;
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_pause();
This replacement works because lazy_mmu_mode_pause() still calls
arch_leave_lazy_mmu_mode() atleast for now.
>
> index = PFN_DOWN(addr - data->start);
> page = data->pages[index];
> @@ -319,7 +319,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> }
> spin_unlock(&init_mm.page_table_lock);
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_resume();
Same.
>
> return 0;
> }
> @@ -471,7 +471,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> pte_t pte;
> int none;
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_pause();
>
> spin_lock(&init_mm.page_table_lock);
> pte = ptep_get(ptep);
> @@ -483,7 +483,7 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> if (likely(!none))
> __free_page(pfn_to_page(pte_pfn(pte)));
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_resume();
>
> return 0;
> }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index b617b1be0f53..6bf7009fa5ce 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -453,7 +453,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> if (!start_pte)
> return 0;
> flush_tlb_batched_pending(mm);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> nr = 1;
> ptent = ptep_get(pte);
> @@ -461,7 +461,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> if (++batch_count == SWAP_CLUSTER_MAX) {
> batch_count = 0;
> if (need_resched()) {
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(start_pte, ptl);
> cond_resched();
> goto restart;
> @@ -497,7 +497,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> if (!folio_trylock(folio))
> continue;
> folio_get(folio);
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(start_pte, ptl);
> start_pte = NULL;
> err = split_folio(folio);
> @@ -508,7 +508,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> if (!start_pte)
> break;
> flush_tlb_batched_pending(mm);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> if (!err)
> nr = 0;
> continue;
> @@ -556,7 +556,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> }
>
> if (start_pte) {
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(start_pte, ptl);
> }
> if (pageout)
> @@ -675,7 +675,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> if (!start_pte)
> return 0;
> flush_tlb_batched_pending(mm);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
> nr = 1;
> ptent = ptep_get(pte);
> @@ -724,7 +724,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> if (!folio_trylock(folio))
> continue;
> folio_get(folio);
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(start_pte, ptl);
> start_pte = NULL;
> err = split_folio(folio);
> @@ -735,7 +735,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> if (!start_pte)
> break;
> flush_tlb_batched_pending(mm);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> if (!err)
> nr = 0;
> continue;
> @@ -775,7 +775,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> if (nr_swap)
> add_mm_counter(mm, MM_SWAPENTS, nr_swap);
> if (start_pte) {
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(start_pte, ptl);
> }
> cond_resched();
> diff --git a/mm/memory.c b/mm/memory.c
> index 6675e87eb7dd..c0c29a3b0bcc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1256,7 +1256,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> orig_src_pte = src_pte;
> orig_dst_pte = dst_pte;
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> do {
> nr = 1;
> @@ -1325,7 +1325,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> } while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
> addr != end);
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(orig_src_pte, src_ptl);
> add_mm_rss_vec(dst_mm, rss);
> pte_unmap_unlock(orig_dst_pte, dst_ptl);
> @@ -1842,7 +1842,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> return addr;
>
> flush_tlb_batched_pending(mm);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> do {
> bool any_skipped = false;
>
> @@ -1874,7 +1874,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
>
> add_mm_rss_vec(mm, rss);
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
>
> /* Do the actual TLB flush before dropping ptl */
> if (force_flush) {
> @@ -2813,7 +2813,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> if (!pte)
> return -ENOMEM;
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> do {
> BUG_ON(!pte_none(ptep_get(pte)));
> if (!pfn_modify_allowed(pfn, prot)) {
> @@ -2823,7 +2823,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
> set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
> pfn++;
> } while (pte++, addr += PAGE_SIZE, addr != end);
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(mapped_pte, ptl);
> return err;
> }
> @@ -3174,7 +3174,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> return -EINVAL;
> }
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> if (fn) {
> do {
> @@ -3187,7 +3187,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> }
> *mask |= PGTBL_PTE_MODIFIED;
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
>
> if (mm != &init_mm)
> pte_unmap_unlock(mapped_pte, ptl);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 23379663b1e1..0346c2d7819f 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -271,7 +271,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> ptep = pte_offset_map_lock(mm, pmdp, start, &ptl);
> if (!ptep)
> goto again;
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> ptep += (addr - start) / PAGE_SIZE;
>
> for (; addr < end; addr += PAGE_SIZE, ptep++) {
> @@ -313,7 +313,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> if (folio_test_large(folio)) {
> int ret;
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(ptep, ptl);
> ret = migrate_vma_split_folio(folio,
> migrate->fault_page);
> @@ -356,7 +356,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> if (folio && folio_test_large(folio)) {
> int ret;
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(ptep, ptl);
> ret = migrate_vma_split_folio(folio,
> migrate->fault_page);
> @@ -485,7 +485,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> if (unmapped)
> flush_tlb_range(walk->vma, start, end);
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(ptep - 1, ptl);
>
> return 0;
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 283889e4f1ce..c0571445bef7 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -233,7 +233,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> is_private_single_threaded = vma_is_single_threaded_private(vma);
>
> flush_tlb_batched_pending(vma->vm_mm);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> do {
> nr_ptes = 1;
> oldpte = ptep_get(pte);
> @@ -379,7 +379,7 @@ static long change_pte_range(struct mmu_gather *tlb,
> }
> }
> } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end);
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(pte - 1, ptl);
>
> return pages;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 672264807db6..8275b9772ec1 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -260,7 +260,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> if (new_ptl != old_ptl)
> spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> flush_tlb_batched_pending(vma->vm_mm);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
> @@ -305,7 +305,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> }
> }
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> if (force_flush)
> flush_tlb_range(vma, old_end - len, old_end);
> if (new_ptl != old_ptl)
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e6dfd5f28acd..b11f81095fa5 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1103,7 +1103,7 @@ static long move_present_ptes(struct mm_struct *mm,
> /* It's safe to drop the reference now as the page-table is holding one. */
> folio_put(*first_src_folio);
> *first_src_folio = NULL;
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> while (true) {
> orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> @@ -1140,7 +1140,7 @@ static long move_present_ptes(struct mm_struct *mm,
> break;
> }
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> if (src_addr > src_start)
> flush_tlb_range(src_vma, src_start, src_addr);
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ecbac900c35f..1dea299fbb5a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -108,7 +108,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> if (!pte)
> return -ENOMEM;
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> do {
> if (unlikely(!pte_none(ptep_get(pte)))) {
> @@ -134,7 +134,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> pfn++;
> } while (pte += PFN_DOWN(size), addr += size, addr != end);
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> *mask |= PGTBL_PTE_MODIFIED;
> return 0;
> }
> @@ -366,7 +366,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> unsigned long size = PAGE_SIZE;
>
> pte = pte_offset_kernel(pmd, addr);
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> do {
> #ifdef CONFIG_HUGETLB_PAGE
> @@ -385,7 +385,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> WARN_ON(!pte_none(ptent) && !pte_present(ptent));
> } while (pte += (size >> PAGE_SHIFT), addr += size, addr != end);
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> *mask |= PGTBL_PTE_MODIFIED;
> }
>
> @@ -533,7 +533,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> if (!pte)
> return -ENOMEM;
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> do {
> struct page *page = pages[*nr];
> @@ -555,7 +555,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
> (*nr)++;
> } while (pte++, addr += PAGE_SIZE, addr != end);
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> *mask |= PGTBL_PTE_MODIFIED;
>
> return err;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 92980b072121..564c97a9362f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3515,7 +3515,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> return false;
> }
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
> restart:
> for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
> unsigned long pfn;
> @@ -3556,7 +3556,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
> goto restart;
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> pte_unmap_unlock(pte, ptl);
>
> return suitable_to_scan(total, young);
> @@ -3597,7 +3597,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
> if (!spin_trylock(ptl))
> goto done;
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> do {
> unsigned long pfn;
> @@ -3644,7 +3644,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
>
> walk_update_folio(walk, last, gen, dirty);
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
> spin_unlock(ptl);
> done:
> *first = -1;
> @@ -4243,7 +4243,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> }
> }
>
> - arch_enter_lazy_mmu_mode();
> + lazy_mmu_mode_enable();
>
> pte -= (addr - start) / PAGE_SIZE;
>
> @@ -4277,7 +4277,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
>
> walk_update_folio(walk, last, gen, dirty);
>
> - arch_leave_lazy_mmu_mode();
> + lazy_mmu_mode_disable();
>
> /* feedback from rmap walkers to page table walkers */
> if (mm_state && suitable_to_scan(i, young))
LGTM and also no apparent problem on arm64 platform.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 07/12] mm: bail out of lazy_mmu_mode_* in interrupt context
2025-11-24 13:22 ` [PATCH v5 07/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
2025-11-24 14:11 ` David Hildenbrand (Red Hat)
@ 2025-12-04 4:34 ` Anshuman Khandual
1 sibling, 0 replies; 40+ messages in thread
From: Anshuman Khandual @ 2025-12-04 4:34 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 24/11/25 6:52 PM, Kevin Brodsky wrote:
> The lazy MMU mode cannot be used in interrupt context. This is
> documented in <linux/pgtable.h>, but isn't consistently handled
> across architectures.
>
> arm64 ensures that calls to lazy_mmu_mode_* have no effect in
> interrupt context, because such calls do occur in certain
> configurations - see commit b81c688426a9 ("arm64/mm: Disable barrier
> batching in interrupt contexts"). Other architectures do not check
> this situation, most likely because it hasn't occurred so far.
>
> Let's handle this in the new generic lazy_mmu layer, in the same
> fashion as arm64: bail out of lazy_mmu_mode_* if in_interrupt().
> Also remove the arm64 handling that is now redundant.
Ensuring a common behaviour across platforms such as this in interrupt
context bail out - can now be easily achieved now as there are generic
helpers that call the platform ones.
> > Both arm64 and x86/Xen also ensure that any lazy MMU optimisation is
> disabled while in interrupt (see queue_pte_barriers() and
> xen_get_lazy_mode() respectively). This will be handled in the
> generic layer in a subsequent patch.
Makes sense.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> arch/arm64/include/asm/pgtable.h | 9 ---------
> include/linux/pgtable.h | 17 +++++++++++++++--
> 2 files changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 54f8d6bb6f22..e596899f4029 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -94,26 +94,17 @@ static inline void arch_enter_lazy_mmu_mode(void)
> * keeps tracking simple.
> */
>
> - if (in_interrupt())
> - return;
> -
> set_thread_flag(TIF_LAZY_MMU);
> }
>
> static inline void arch_flush_lazy_mmu_mode(void)
> {
> - if (in_interrupt())
> - return;
> -
> if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING))
> emit_pte_barriers();
> }
>
> static inline void arch_leave_lazy_mmu_mode(void)
> {
> - if (in_interrupt())
> - return;
> -
> arch_flush_lazy_mmu_mode();
> clear_thread_flag(TIF_LAZY_MMU);
> }
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c121358dba15..8ff6fdb4b13d 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -228,27 +228,40 @@ static inline int pmd_dirty(pmd_t pmd)
> * of the lazy mode. So the implementation must assume preemption may be enabled
> * and cpu migration is possible; it must take steps to be robust against this.
> * (In practice, for user PTE updates, the appropriate page table lock(s) are
> - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted
> - * and the mode cannot be used in interrupt context.
> + * held, but for kernel PTE updates, no lock is held). The mode is disabled in
> + * interrupt context and calls to the lazy_mmu API have no effect.
> + * Nesting is not permitted.
Right - nesting is still not not permitted.
> */> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> static inline void lazy_mmu_mode_enable(void)
> {
> + if (in_interrupt())
> + return;
> +
> arch_enter_lazy_mmu_mode();
> }
>
> static inline void lazy_mmu_mode_disable(void)
> {
> + if (in_interrupt())
> + return;
> +
> arch_leave_lazy_mmu_mode();
> }
>
> static inline void lazy_mmu_mode_pause(void)
> {
> + if (in_interrupt())
> + return;
> +
> arch_leave_lazy_mmu_mode();
> }
>
> static inline void lazy_mmu_mode_resume(void)
> {
> + if (in_interrupt())
> + return;
> +
> arch_enter_lazy_mmu_mode();
> }
> #else
LGTM
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-12-03 8:20 ` Kevin Brodsky
@ 2025-12-04 5:25 ` Anshuman Khandual
2025-12-04 11:53 ` David Hildenbrand (Red Hat)
0 siblings, 1 reply; 40+ messages in thread
From: Anshuman Khandual @ 2025-12-04 5:25 UTC (permalink / raw)
To: Kevin Brodsky, Alexander Gordeev
Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 03/12/25 1:50 PM, Kevin Brodsky wrote:
> On 28/11/2025 14:55, Alexander Gordeev wrote:
>>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>>> + * currently enabled.
>> The in_lazy_mmu_mode() name looks ambiguous to me. When the lazy MMU mode
>> is paused are we still in lazy MMU mode? The __task_lazy_mmu_mode_active()
>> implementation suggests we are not, while one could still assume we are,
>> just paused.
>>
>> Should in_lazy_mmu_mode() be named e.g. as in_active_lazy_mmu_mode() such
>> a confusion would not occur in the first place.
>
> I see your point, how about is_lazy_mmu_mode_active()?
Agreed - is_lazy_mmu_mode_active() seems better.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-11-24 13:22 ` [PATCH v5 08/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
` (2 preceding siblings ...)
2025-11-28 13:55 ` Alexander Gordeev
@ 2025-12-04 6:23 ` Anshuman Khandual
2025-12-04 11:52 ` David Hildenbrand (Red Hat)
2025-12-05 12:56 ` Kevin Brodsky
3 siblings, 2 replies; 40+ messages in thread
From: Anshuman Khandual @ 2025-12-04 6:23 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 24/11/25 6:52 PM, Kevin Brodsky wrote:
> Despite recent efforts to prevent lazy_mmu sections from nesting, it
> remains difficult to ensure that it never occurs - and in fact it
> does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC).
> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> made nesting tolerable on arm64, but without truly supporting it:
> the inner call to leave() disables the batching optimisation before
> the outer section ends.
>
> This patch actually enables lazy_mmu sections to nest by tracking
> the nesting level in task_struct, in a similar fashion to e.g.
> pagefault_{enable,disable}(). This is fully handled by the generic
> lazy_mmu helpers that were recently introduced.
>
> lazy_mmu sections were not initially intended to nest, so we need to
> clarify the semantics w.r.t. the arch_*_lazy_mmu_mode() callbacks.
> This patch takes the following approach:
>
> * The outermost calls to lazy_mmu_mode_{enable,disable}() trigger
> calls to arch_{enter,leave}_lazy_mmu_mode() - this is unchanged.
>
> * Nested calls to lazy_mmu_mode_{enable,disable}() are not forwarded
> to the arch via arch_{enter,leave} - lazy MMU remains enabled so
> the assumption is that these callbacks are not relevant. However,
> existing code may rely on a call to disable() to flush any batched
> state, regardless of nesting. arch_flush_lazy_mmu_mode() is
> therefore called in that situation.
>
> A separate interface was recently introduced to temporarily pause
> the lazy MMU mode: lazy_mmu_mode_{pause,resume}(). pause() fully
> exits the mode *regardless of the nesting level*, and resume()
> restores the mode at the same nesting level.
>
> pause()/resume() are themselves allowed to nest, so we actually
> store two nesting levels in task_struct: enable_count and
> pause_count. A new helper in_lazy_mmu_mode() is introduced to
> determine whether we are currently in lazy MMU mode; this will be
> used in subsequent patches to replace the various ways arch's
> currently track whether the mode is enabled.
>
> In summary (enable/pause represent the values *after* the call):
>
> lazy_mmu_mode_enable() -> arch_enter() enable=1 pause=0
> lazy_mmu_mode_enable() -> ø enable=2 pause=0
> lazy_mmu_mode_pause() -> arch_leave() enable=2 pause=1
> lazy_mmu_mode_resume() -> arch_enter() enable=2 pause=0
> lazy_mmu_mode_disable() -> arch_flush() enable=1 pause=0
> lazy_mmu_mode_disable() -> arch_leave() enable=0 pause=0
>
> Note: in_lazy_mmu_mode() is added to <linux/sched.h> to allow arch
> headers included by <linux/pgtable.h> to use it.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> arch/arm64/include/asm/pgtable.h | 12 ----
> include/linux/mm_types_task.h | 5 ++
> include/linux/pgtable.h | 115 +++++++++++++++++++++++++++++--
> include/linux/sched.h | 45 ++++++++++++
> 4 files changed, 158 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index e596899f4029..a7d99dee3dc4 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -82,18 +82,6 @@ static inline void queue_pte_barriers(void)
>
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> - /*
> - * lazy_mmu_mode is not supposed to permit nesting. But in practice this
> - * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
> - * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
> - * permissions on the linear map with apply_to_page_range(), which
> - * re-enters lazy_mmu_mode. So we tolerate nesting in our
> - * implementation. The first call to arch_leave_lazy_mmu_mode() will
> - * flush and clear the flag such that the remainder of the work in the
> - * outer nest behaves as if outside of lazy mmu mode. This is safe and
> - * keeps tracking simple.
> - */
> -
> set_thread_flag(TIF_LAZY_MMU);> }
Should not platform specific changes be deferred to subsequent patches until
nesting is completely enabled in generic first ? Although no problem as such
but would be bit cleaner.
>
> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
> index a82aa80c0ba4..11bf319d78ec 100644
> --- a/include/linux/mm_types_task.h
> +++ b/include/linux/mm_types_task.h
> @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
> #endif
> };
>
> +struct lazy_mmu_state {
> + u8 enable_count;
> + u8 pause_count;
> +};
> +
Should not this be wrapped with CONFIG_ARCH_HAS_LAZY_MMU_MODE as the task_struct
element 'lazy_mmu_state' is only available with the feature. Besides, is a depth
of 256 really expected here ? 4 bits for each element would not be sufficient for
a depth of 16 ?
> #endif /* _LINUX_MM_TYPES_TASK_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 8ff6fdb4b13d..24fdb6f5c2e1 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -230,39 +230,140 @@ static inline int pmd_dirty(pmd_t pmd)
> * (In practice, for user PTE updates, the appropriate page table lock(s) are
> * held, but for kernel PTE updates, no lock is held). The mode is disabled in
> * interrupt context and calls to the lazy_mmu API have no effect.
> - * Nesting is not permitted.
> + *
> + * The lazy MMU mode is enabled for a given block of code using:
> + *
> + * lazy_mmu_mode_enable();
> + * <code>
> + * lazy_mmu_mode_disable();
> + *
> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
> + * A nested call to enable() has no functional effect; however disable() causes
> + * any batched architectural state to be flushed regardless of nesting. After a
Just wondering if there is a method for these generic helpers to ensure that platform
really does the required flushing on _disable() or the expected platform semantics is
only described via this comment alone ?
> + * call to disable(), the caller can therefore rely on all previous page table
> + * modifications to have taken effect, but the lazy MMU mode may still be
> + * enabled.
> + *
> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
> + * This can be done using:
> + *
> + * lazy_mmu_mode_pause();
> + * <code>
> + * lazy_mmu_mode_resume();
> + *
> + * pause() ensures that the mode is exited regardless of the nesting level;
> + * resume() re-enters the mode at the same nesting level. Any call to the
> + * lazy_mmu_mode_* API between those two calls has no effect. In particular,
> + * this means that pause()/resume() pairs may nest.
> + *
> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
> + * currently enabled.
Just wondering - could a corresponding test be included probably via KUNIT_TEST
to ensure the above described semantics are being followed.
> */
> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +/**
> + * lazy_mmu_mode_enable() - Enable the lazy MMU mode.
> + *
> + * Enters a new lazy MMU mode section; if the mode was not already enabled,
> + * enables it and calls arch_enter_lazy_mmu_mode().
> + *
> + * Must be paired with a call to lazy_mmu_mode_disable().
> + *
> + * Has no effect if called:
> + * - While paused - see lazy_mmu_mode_pause()
> + * - In interrupt context
> + */
> static inline void lazy_mmu_mode_enable(void)
> {
> - if (in_interrupt())
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + if (in_interrupt() || state->pause_count > 0)
> return;
>
> - arch_enter_lazy_mmu_mode();
> + VM_WARN_ON_ONCE(state->enable_count == U8_MAX);
> +
> + if (state->enable_count++ == 0)
> + arch_enter_lazy_mmu_mode();
When lazy_mmu_mode_enable() gets called for the first time with state->enable_count as 0,
then arch_enter_lazy_mmu_mode() will not get called ? Bit confused.
> }
>
> +/**
> + * lazy_mmu_mode_disable() - Disable the lazy MMU mode.
> + *
> + * Exits the current lazy MMU mode section. If it is the outermost section,
> + * disables the mode and calls arch_leave_lazy_mmu_mode(). Otherwise (nested
> + * section), calls arch_flush_lazy_mmu_mode().
> + *
> + * Must match a call to lazy_mmu_mode_enable().
> + *
> + * Has no effect if called:
> + * - While paused - see lazy_mmu_mode_pause()
> + * - In interrupt context
> + */
> static inline void lazy_mmu_mode_disable(void)
> {
> - if (in_interrupt())
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> + if (in_interrupt() || state->pause_count > 0)
> return;
>
> - arch_leave_lazy_mmu_mode();
> + VM_WARN_ON_ONCE(state->enable_count == 0);
> +
> + if (--state->enable_count == 0)
> + arch_leave_lazy_mmu_mode();
> + else /* Exiting a nested section */
> + arch_flush_lazy_mmu_mode();
> +
> }
>
> +/**
> + * lazy_mmu_mode_pause() - Pause the lazy MMU mode.
> + *
> + * Pauses the lazy MMU mode; if it is currently active, disables it and calls
> + * arch_leave_lazy_mmu_mode().
> + *
> + * Must be paired with a call to lazy_mmu_mode_resume(). Calls to the
> + * lazy_mmu_mode_* API have no effect until the matching resume() call.
> + *
> + * Has no effect if called:
> + * - While paused (inside another pause()/resume() pair)
> + * - In interrupt context
> + */
> static inline void lazy_mmu_mode_pause(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> if (in_interrupt())
> return;
>
> - arch_leave_lazy_mmu_mode();
> + VM_WARN_ON_ONCE(state->pause_count == U8_MAX);
> +
> + if (state->pause_count++ == 0 && state->enable_count > 0)
> + arch_leave_lazy_mmu_mode();
> }
>
> +/**
> + * lazy_mmu_mode_pause() - Resume the lazy MMU mode.
> + *
> + * Resumes the lazy MMU mode; if it was active at the point where the matching
> + * call to lazy_mmu_mode_pause() was made, re-enables it and calls
> + * arch_enter_lazy_mmu_mode().
> + *
> + * Must match a call to lazy_mmu_mode_pause().
> + *
> + * Has no effect if called:
> + * - While paused (inside another pause()/resume() pair)
> + * - In interrupt context
> + */
> static inline void lazy_mmu_mode_resume(void)
> {
> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
> +
> if (in_interrupt())
> return;
>
> - arch_enter_lazy_mmu_mode();
> + VM_WARN_ON_ONCE(state->pause_count == 0);
> +
> + if (--state->pause_count == 0 && state->enable_count > 0)
> + arch_enter_lazy_mmu_mode();
> }
Should not state->pause/enable_count tests and increment/decrement be handled
inside include/linux/sched via helpers like in_lazy_mmu_mode() ? This is will
ensure cleaner abstraction with respect to task_struct.
> #else
> static inline void lazy_mmu_mode_enable(void) {}
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b469878de25c..847e242376db 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1441,6 +1441,10 @@ struct task_struct {
>
> struct page_frag task_frag;
>
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> + struct lazy_mmu_state lazy_mmu_state;
> +#endif
> +
> #ifdef CONFIG_TASK_DELAY_ACCT
> struct task_delay_info *delays;
> #endif
> @@ -1724,6 +1728,47 @@ static inline char task_state_to_char(struct task_struct *tsk)
> return task_index_to_char(task_state_index(tsk));
> }
>
> +#ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
> +/**
> + * __task_lazy_mmu_mode_active() - Test the lazy MMU mode state for a task.
> + * @tsk: The task to check.
> + *
> + * Test whether @tsk has its lazy MMU mode state set to active (i.e. enabled
> + * and not paused).
> + *
> + * This function only considers the state saved in task_struct; to test whether
> + * current actually is in lazy MMU mode, in_lazy_mmu_mode() should be used
> + * instead.
> + *
> + * This function is intended for architectures that implement the lazy MMU
> + * mode; it must not be called from generic code.
> + */
> +static inline bool __task_lazy_mmu_mode_active(struct task_struct *tsk)
> +{
> + struct lazy_mmu_state *state = &tsk->lazy_mmu_state;
> +
> + return state->enable_count > 0 && state->pause_count == 0;
> +}
> +
> +/**
> + * in_lazy_mmu_mode() - Test whether we are currently in lazy MMU mode.
> + *
> + * Test whether the current context is in lazy MMU mode. This is true if both:
> + * 1. We are not in interrupt context
> + * 2. Lazy MMU mode is active for the current task
> + *
> + * This function is intended for architectures that implement the lazy MMU
> + * mode; it must not be called from generic code.
> + */
> +static inline bool in_lazy_mmu_mode(void)
> +{
> + if (in_interrupt())
> + return false;
> +
> + return __task_lazy_mmu_mode_active(current);
> +}
> +#endif
> +
> extern struct pid *cad_pid;
>
> /*
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
2025-11-24 13:22 ` [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
2025-11-24 14:10 ` David Hildenbrand (Red Hat)
@ 2025-12-04 6:52 ` Anshuman Khandual
2025-12-04 11:39 ` David Hildenbrand (Red Hat)
1 sibling, 1 reply; 40+ messages in thread
From: Anshuman Khandual @ 2025-12-04 6:52 UTC (permalink / raw)
To: Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 24/11/25 6:52 PM, Kevin Brodsky wrote:
> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
> mode. As a result we no longer need a TIF flag for that purpose -
> let's use the new in_lazy_mmu_mode() helper instead.
>
> The explicit check for in_interrupt() is no longer necessary either
> as in_lazy_mmu_mode() always returns false in interrupt context.
>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
> arch/arm64/include/asm/pgtable.h | 19 +++----------------
> arch/arm64/include/asm/thread_info.h | 3 +--
> 2 files changed, 4 insertions(+), 18 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index a7d99dee3dc4..dd7ed653a20d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -62,28 +62,16 @@ static inline void emit_pte_barriers(void)
>
> static inline void queue_pte_barriers(void)
> {
> - unsigned long flags;
> -
> - if (in_interrupt()) {
> - emit_pte_barriers();
> - return;
> - }
> -
> - flags = read_thread_flags();
> -
> - if (flags & BIT(TIF_LAZY_MMU)) {
> + if (in_lazy_mmu_mode()) {
> /* Avoid the atomic op if already set. */
> - if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
> + if (!test_thread_flag(TIF_LAZY_MMU_PENDING))
A small nit - will it be better not to use test_thread_flag() here and just
keep checking flags like earlier to avoid non-related changes. Although not
a problem TBH.
> set_thread_flag(TIF_LAZY_MMU_PENDING);
> } else {
> emit_pte_barriers();
> }
> }
>
> -static inline void arch_enter_lazy_mmu_mode(void)
> -{
> - set_thread_flag(TIF_LAZY_MMU);
> -}
> +static inline void arch_enter_lazy_mmu_mode(void) {}
>
> static inline void arch_flush_lazy_mmu_mode(void)
> {
> @@ -94,7 +82,6 @@ static inline void arch_flush_lazy_mmu_mode(void)
> static inline void arch_leave_lazy_mmu_mode(void)
> {
> arch_flush_lazy_mmu_mode();
> - clear_thread_flag(TIF_LAZY_MMU);
> }
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h
> index f241b8601ebd..4ff8da0767d9 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -84,8 +84,7 @@ void arch_setup_new_exec(void);
> #define TIF_SME_VL_INHERIT 28 /* Inherit SME vl_onexec across exec */
> #define TIF_KERNEL_FPSTATE 29 /* Task is in a kernel mode FPSIMD section */
> #define TIF_TSC_SIGSEGV 30 /* SIGSEGV on counter-timer access */
> -#define TIF_LAZY_MMU 31 /* Task in lazy mmu mode */
> -#define TIF_LAZY_MMU_PENDING 32 /* Ops pending for lazy mmu mode exit */
> +#define TIF_LAZY_MMU_PENDING 31 /* Ops pending for lazy mmu mode exit */
>
> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
LGTM.
Hence with or without the 'flags' changes in queue_pte_barriers() above.
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode()
2025-12-04 6:52 ` Anshuman Khandual
@ 2025-12-04 11:39 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-04 11:39 UTC (permalink / raw)
To: Anshuman Khandual, Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 12/4/25 07:52, Anshuman Khandual wrote:
> On 24/11/25 6:52 PM, Kevin Brodsky wrote:
>> The generic lazy_mmu layer now tracks whether a task is in lazy MMU
>> mode. As a result we no longer need a TIF flag for that purpose -
>> let's use the new in_lazy_mmu_mode() helper instead.
>>
>> The explicit check for in_interrupt() is no longer necessary either
>> as in_lazy_mmu_mode() always returns false in interrupt context.
>>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>> arch/arm64/include/asm/pgtable.h | 19 +++----------------
>> arch/arm64/include/asm/thread_info.h | 3 +--
>> 2 files changed, 4 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index a7d99dee3dc4..dd7ed653a20d 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -62,28 +62,16 @@ static inline void emit_pte_barriers(void)
>>
>> static inline void queue_pte_barriers(void)
>> {
>> - unsigned long flags;
>> -
>> - if (in_interrupt()) {
>> - emit_pte_barriers();
>> - return;
>> - }
>> -
>> - flags = read_thread_flags();
>> -
>> - if (flags & BIT(TIF_LAZY_MMU)) {
>> + if (in_lazy_mmu_mode()) {
>> /* Avoid the atomic op if already set. */
>> - if (!(flags & BIT(TIF_LAZY_MMU_PENDING)))
>> + if (!test_thread_flag(TIF_LAZY_MMU_PENDING))
>
> A small nit - will it be better not to use test_thread_flag() here and just
> keep checking flags like earlier to avoid non-related changes. Although not
> a problem TBH.
I'd assume the existing code wanted to avoid fetching the flags two
times? So switching to test_thread_flag() should be fine now.
--
Cheers
David
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-12-04 6:23 ` Anshuman Khandual
@ 2025-12-04 11:52 ` David Hildenbrand (Red Hat)
2025-12-05 12:50 ` Kevin Brodsky
2025-12-05 12:56 ` Kevin Brodsky
1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-04 11:52 UTC (permalink / raw)
To: Anshuman Khandual, Kevin Brodsky, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
Some comments from my side:
>> static inline void arch_enter_lazy_mmu_mode(void)
>> {
>> - /*
>> - * lazy_mmu_mode is not supposed to permit nesting. But in practice this
>> - * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
>> - * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
>> - * permissions on the linear map with apply_to_page_range(), which
>> - * re-enters lazy_mmu_mode. So we tolerate nesting in our
>> - * implementation. The first call to arch_leave_lazy_mmu_mode() will
>> - * flush and clear the flag such that the remainder of the work in the
>> - * outer nest behaves as if outside of lazy mmu mode. This is safe and
>> - * keeps tracking simple.
>> - */
>> -
>> set_thread_flag(TIF_LAZY_MMU);> }
>
> Should not platform specific changes be deferred to subsequent patches until
> nesting is completely enabled in generic first ? Although no problem as such
> but would be bit cleaner.
This could indeed be done in a separate patch. But I also don't see a
problem with updating the doc in this patch.
>
>>
>> diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
>> index a82aa80c0ba4..11bf319d78ec 100644
>> --- a/include/linux/mm_types_task.h
>> +++ b/include/linux/mm_types_task.h
>> @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
>> #endif
>> };
>>
>> +struct lazy_mmu_state {
>> + u8 enable_count;
>> + u8 pause_count;
>> +};
>> +
>
> Should not this be wrapped with CONFIG_ARCH_HAS_LAZY_MMU_MODE as the task_struct
> element 'lazy_mmu_state' is only available with the feature.
No strong opinion; the compiler will ignore it either way. And less
ifdef is good, right? :)
... and there is nothing magical in there that would result in other
dependencies.
> Besides, is a depth
> of 256 really expected here ? 4 bits for each element would not be sufficient for
> a depth of 16 ?
We could indeed use something like
struct lazy_mmu_state {
u8 enable_count : 4;
u8 pause_count : 4;
};
but then, the individual operations on enable_count/pause_count need
more instructions.
Further, as discussed, this 1 additional byte barely matters given the
existing size of the task struct.
No strong opinion.
>
>> */
>> #ifdef CONFIG_ARCH_HAS_LAZY_MMU_MODE
>> +/**
>> + * lazy_mmu_mode_enable() - Enable the lazy MMU mode.
>> + *
>> + * Enters a new lazy MMU mode section; if the mode was not already enabled,
>> + * enables it and calls arch_enter_lazy_mmu_mode().
>> + *
>> + * Must be paired with a call to lazy_mmu_mode_disable().
>> + *
>> + * Has no effect if called:
>> + * - While paused - see lazy_mmu_mode_pause()
>> + * - In interrupt context
>> + */
>> static inline void lazy_mmu_mode_enable(void)
>> {
>> - if (in_interrupt())
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> + if (in_interrupt() || state->pause_count > 0)
>> return;
>>
>> - arch_enter_lazy_mmu_mode();
>> + VM_WARN_ON_ONCE(state->enable_count == U8_MAX);
>> +
>> + if (state->enable_count++ == 0)
>> + arch_enter_lazy_mmu_mode();
>
> When lazy_mmu_mode_enable() gets called for the first time with state->enable_count as 0,
> then arch_enter_lazy_mmu_mode() will not get called ? Bit confused.
state->enable_count++ returns the old value (0). Are you thinking of
++state->enable_count?
But maybe I misudnerstood your concern.
[...]
>> +/**
>> + * lazy_mmu_mode_pause() - Resume the lazy MMU mode.
>> + *
>> + * Resumes the lazy MMU mode; if it was active at the point where the matching
>> + * call to lazy_mmu_mode_pause() was made, re-enables it and calls
>> + * arch_enter_lazy_mmu_mode().
>> + *
>> + * Must match a call to lazy_mmu_mode_pause().
>> + *
>> + * Has no effect if called:
>> + * - While paused (inside another pause()/resume() pair)
>> + * - In interrupt context
>> + */
>> static inline void lazy_mmu_mode_resume(void)
>> {
>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>> +
>> if (in_interrupt())
>> return;
>>
>> - arch_enter_lazy_mmu_mode();
>> + VM_WARN_ON_ONCE(state->pause_count == 0);
>> +
>> + if (--state->pause_count == 0 && state->enable_count > 0)
>> + arch_enter_lazy_mmu_mode();
>> }
>
> Should not state->pause/enable_count tests and increment/decrement be handled
> inside include/linux/sched via helpers like in_lazy_mmu_mode() ? This is will
> ensure cleaner abstraction with respect to task_struct.
I don't think this is required given that this code here implements
CONFIG_ARCH_HAS_LAZY_MMU_MODE support.
--
Cheers
David
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-12-04 5:25 ` Anshuman Khandual
@ 2025-12-04 11:53 ` David Hildenbrand (Red Hat)
0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-04 11:53 UTC (permalink / raw)
To: Anshuman Khandual, Kevin Brodsky, Alexander Gordeev
Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 12/4/25 06:25, Anshuman Khandual wrote:
>
>
> On 03/12/25 1:50 PM, Kevin Brodsky wrote:
>> On 28/11/2025 14:55, Alexander Gordeev wrote:
>>>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>>>> + * currently enabled.
>>> The in_lazy_mmu_mode() name looks ambiguous to me. When the lazy MMU mode
>>> is paused are we still in lazy MMU mode? The __task_lazy_mmu_mode_active()
>>> implementation suggests we are not, while one could still assume we are,
>>> just paused.
>>>
>>> Should in_lazy_mmu_mode() be named e.g. as in_active_lazy_mmu_mode() such
>>> a confusion would not occur in the first place.
>>
>> I see your point, how about is_lazy_mmu_mode_active()?
>
> Agreed - is_lazy_mmu_mode_active() seems better.
+1, I was scratching my head over this in previous revisions as well.
--
Cheers
David
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers
2025-12-04 4:17 ` Anshuman Khandual
@ 2025-12-05 12:47 ` Kevin Brodsky
0 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-12-05 12:47 UTC (permalink / raw)
To: Anshuman Khandual, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 04/12/2025 05:17, Anshuman Khandual wrote:
> On 24/11/25 6:52 PM, Kevin Brodsky wrote:
>> The implementation of the lazy MMU mode is currently entirely
>> arch-specific; core code directly calls arch helpers:
>> arch_{enter,leave}_lazy_mmu_mode().
>>
>> We are about to introduce support for nested lazy MMU sections.
>> As things stand we'd have to duplicate that logic in every arch
>> implementing lazy_mmu - adding to a fair amount of logic
>> already duplicated across lazy_mmu implementations.
>>
>> This patch therefore introduces a new generic layer that calls the
>> existing arch_* helpers. Two pair of calls are introduced:
>>
>> * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable()
>> This is the standard case where the mode is enabled for a given
>> block of code by surrounding it with enable() and disable()
>> calls.
>>
>> * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume()
>> This is for situations where the mode is temporarily disabled
>> by first calling pause() and then resume() (e.g. to prevent any
>> batching from occurring in a critical section).
>>
>> The documentation in <linux/pgtable.h> will be updated in a
>> subsequent patch.
>>> No functional change should be introduced at this stage.
>> The implementation of enable()/resume() and disable()/pause() is
>> currently identical, but nesting support will change that.
>>
>> Most of the call sites have been updated using the following
>> Coccinelle script:
>>
>> @@
>> @@
>> {
>> ...
>> - arch_enter_lazy_mmu_mode();
>> + lazy_mmu_mode_enable();
>> ...
>> - arch_leave_lazy_mmu_mode();
>> + lazy_mmu_mode_disable();
>> ...
>> }
>>
>> @@
>> @@
>> {
>> ...
>> - arch_leave_lazy_mmu_mode();
>> + lazy_mmu_mode_pause();
>> ...
>> - arch_enter_lazy_mmu_mode();
>> + lazy_mmu_mode_resume();
>> ...
>> }
> At this point arch_enter/leave_lazy_mmu_mode() helpers are still
> present on a given platform but now being called from new generic
> helpers lazy_mmu_mode_enable/disable(). Well except x86, there is
> direct call sites for those old helpers.
Indeed, see notes below regarding x86. The direct calls to arch_flush()
are specific to x86 and there shouldn't be a need for a generic abstraction.
- Kevin
> arch/arm64/include/asm/pgtable.h:static inline void arch_enter_lazy_mmu_mode(void)
> arch/arm64/include/asm/pgtable.h:static inline void arch_leave_lazy_mmu_mode(void)
>
> arch/arm64/mm/mmu.c: lazy_mmu_mode_enable();
> arch/arm64/mm/pageattr.c: lazy_mmu_mode_enable();
>
> arch/arm64/mm/mmu.c: lazy_mmu_mode_disable();
> arch/arm64/mm/pageattr.c: lazy_mmu_mode_disable();
>
>> A couple of notes regarding x86:
>>
>> * Xen is currently the only case where explicit handling is required
>> for lazy MMU when context-switching. This is purely an
>> implementation detail and using the generic lazy_mmu_mode_*
>> functions would cause trouble when nesting support is introduced,
>> because the generic functions must be called from the current task.
>> For that reason we still use arch_leave() and arch_enter() there.
>>
>> * x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few
>> places, but only defines it if PARAVIRT_XXL is selected, and we
>> are removing the fallback in <linux/pgtable.h>. Add a new fallback
>> definition to <asm/pgtable.h> to keep things building.
>>
>> [...]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-12-04 11:52 ` David Hildenbrand (Red Hat)
@ 2025-12-05 12:50 ` Kevin Brodsky
0 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-12-05 12:50 UTC (permalink / raw)
To: David Hildenbrand (Red Hat), Anshuman Khandual, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David S. Miller, David Woodhouse,
H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross,
Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan,
Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin,
Peter Zijlstra, Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 04/12/2025 12:52, David Hildenbrand (Red Hat) wrote:
> Some comments from my side:
>
>
>>> static inline void arch_enter_lazy_mmu_mode(void)
>>> {
>>> - /*
>>> - * lazy_mmu_mode is not supposed to permit nesting. But in
>>> practice this
>>> - * does happen with CONFIG_DEBUG_PAGEALLOC, where a page
>>> allocation
>>> - * inside a lazy_mmu_mode section (such as zap_pte_range())
>>> will change
>>> - * permissions on the linear map with apply_to_page_range(), which
>>> - * re-enters lazy_mmu_mode. So we tolerate nesting in our
>>> - * implementation. The first call to arch_leave_lazy_mmu_mode()
>>> will
>>> - * flush and clear the flag such that the remainder of the work
>>> in the
>>> - * outer nest behaves as if outside of lazy mmu mode. This is
>>> safe and
>>> - * keeps tracking simple.
>>> - */
>>> -
>>> set_thread_flag(TIF_LAZY_MMU);> }
>>
>> Should not platform specific changes be deferred to subsequent
>> patches until
>> nesting is completely enabled in generic first ? Although no problem
>> as such
>> but would be bit cleaner.
>
> This could indeed be done in a separate patch. But I also don't see a
> problem with updating the doc in this patch.
I think it is consistent to remove that comment in this patch, since
nesting is fully supported from this patch onwards. Subsequent patches
are cleanups/optimisations that aren't functionally required.
Patch 7 takes the same approach: add handling in the generic layer,
remove anything now superfluous from arm64.
>
>>
>>> diff --git a/include/linux/mm_types_task.h
>>> b/include/linux/mm_types_task.h
>>> index a82aa80c0ba4..11bf319d78ec 100644
>>> --- a/include/linux/mm_types_task.h
>>> +++ b/include/linux/mm_types_task.h
>>> @@ -88,4 +88,9 @@ struct tlbflush_unmap_batch {
>>> #endif
>>> };
>>> +struct lazy_mmu_state {
>>> + u8 enable_count;
>>> + u8 pause_count;
>>> +};
>>> +
>>
>> Should not this be wrapped with CONFIG_ARCH_HAS_LAZY_MMU_MODE as the
>> task_struct
>> element 'lazy_mmu_state' is only available with the feature.
>
> No strong opinion; the compiler will ignore it either way. And less
> ifdef is good, right? :)
>
> ... and there is nothing magical in there that would result in other
> dependencies.
Agreed, #ifdef'ing types should only be done if necessary.
>
>> Besides, is a depth
>> of 256 really expected here ? 4 bits for each element would not be
>> sufficient for
>> a depth of 16 ?
>
>
> We could indeed use something like
>
> struct lazy_mmu_state {
> u8 enable_count : 4;
> u8 pause_count : 4;
> };
>
> but then, the individual operations on enable_count/pause_count need
> more instructions.
Indeed.
>
> Further, as discussed, this 1 additional byte barely matters given the
> existing size of the task struct.
In fact it would almost certainly make no difference (depending on
randomized_struct) since almost all members in task_struct have an
alignment of at least 2.
>
> [...]
>
>>> +/**
>>> + * lazy_mmu_mode_pause() - Resume the lazy MMU mode.
>>> + *
>>> + * Resumes the lazy MMU mode; if it was active at the point where
>>> the matching
>>> + * call to lazy_mmu_mode_pause() was made, re-enables it and calls
>>> + * arch_enter_lazy_mmu_mode().
>>> + *
>>> + * Must match a call to lazy_mmu_mode_pause().
>>> + *
>>> + * Has no effect if called:
>>> + * - While paused (inside another pause()/resume() pair)
>>> + * - In interrupt context
>>> + */
>>> static inline void lazy_mmu_mode_resume(void)
>>> {
>>> + struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
>>> +
>>> if (in_interrupt())
>>> return;
>>> - arch_enter_lazy_mmu_mode();
>>> + VM_WARN_ON_ONCE(state->pause_count == 0);
>>> +
>>> + if (--state->pause_count == 0 && state->enable_count > 0)
>>> + arch_enter_lazy_mmu_mode();
>>> }
>>
>> Should not state->pause/enable_count tests and increment/decrement be
>> handled
>> inside include/linux/sched via helpers like in_lazy_mmu_mode() ? This
>> is will
>> ensure cleaner abstraction with respect to task_struct.
>
> I don't think this is required given that this code here implements
> CONFIG_ARCH_HAS_LAZY_MMU_MODE support.
Agreed, in fact I'd rather not expose helpers that should only be used
in the lazy_mmu implementation itself.
- Kevin
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 08/12] mm: enable lazy_mmu sections to nest
2025-12-04 6:23 ` Anshuman Khandual
2025-12-04 11:52 ` David Hildenbrand (Red Hat)
@ 2025-12-05 12:56 ` Kevin Brodsky
1 sibling, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-12-05 12:56 UTC (permalink / raw)
To: Anshuman Khandual, linux-mm
Cc: linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton,
Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Venkat Rao Bagalkote, Vlastimil Babka, Will Deacon, Yeoreum Yun,
linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel, x86
On 04/12/2025 07:23, Anshuman Khandual wrote:
>> [...]
>>
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 8ff6fdb4b13d..24fdb6f5c2e1 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -230,39 +230,140 @@ static inline int pmd_dirty(pmd_t pmd)
>> * (In practice, for user PTE updates, the appropriate page table lock(s) are
>> * held, but for kernel PTE updates, no lock is held). The mode is disabled in
>> * interrupt context and calls to the lazy_mmu API have no effect.
>> - * Nesting is not permitted.
>> + *
>> + * The lazy MMU mode is enabled for a given block of code using:
>> + *
>> + * lazy_mmu_mode_enable();
>> + * <code>
>> + * lazy_mmu_mode_disable();
>> + *
>> + * Nesting is permitted: <code> may itself use an enable()/disable() pair.
>> + * A nested call to enable() has no functional effect; however disable() causes
>> + * any batched architectural state to be flushed regardless of nesting. After a
> Just wondering if there is a method for these generic helpers to ensure that platform
> really does the required flushing on _disable() or the expected platform semantics is
> only described via this comment alone ?
From the generic layer's perspective, flushing means calling
arch_flush_lazy_mmu_mode(). Like the other arch_*_lazy_mmu_mode helpers,
the actual semantics is unspecified - an arch could choose not to do
anything on flush if that's not required for page table changes to be
visible. There is actually an example of this in the kpkeys page table
hardening series [1] (this isn't doing any batching so there is nothing
to flush either).
[1]
https://lore.kernel.org/linux-hardening/20250815085512.2182322-19-kevin.brodsky@arm.com/
>> + * call to disable(), the caller can therefore rely on all previous page table
>> + * modifications to have taken effect, but the lazy MMU mode may still be
>> + * enabled.
>> + *
>> + * In certain cases, it may be desirable to temporarily pause the lazy MMU mode.
>> + * This can be done using:
>> + *
>> + * lazy_mmu_mode_pause();
>> + * <code>
>> + * lazy_mmu_mode_resume();
>> + *
>> + * pause() ensures that the mode is exited regardless of the nesting level;
>> + * resume() re-enters the mode at the same nesting level. Any call to the
>> + * lazy_mmu_mode_* API between those two calls has no effect. In particular,
>> + * this means that pause()/resume() pairs may nest.
>> + *
>> + * in_lazy_mmu_mode() can be used to check whether the lazy MMU mode is
>> + * currently enabled.
> Just wondering - could a corresponding test be included probably via KUNIT_TEST
> to ensure the above described semantics are being followed.
Checking that is_lazy_mmu_mode_active() returns the right value at
different call depths should be doable, yes. I suppose that could live
in some file under mm/tests/ (doesn't exist yet but that's the preferred
approach for KUnit tests).
- Kevin
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v5 00/12] Nesting support for lazy MMU mode
2025-12-03 16:08 ` [PATCH v5 00/12] Nesting support for lazy MMU mode Venkat
@ 2025-12-05 13:00 ` Kevin Brodsky
0 siblings, 0 replies; 40+ messages in thread
From: Kevin Brodsky @ 2025-12-05 13:00 UTC (permalink / raw)
To: Venkat
Cc: linux-mm, LKML, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, David Woodhouse, H. Peter Anvin, Ingo Molnar,
Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra,
Ritesh Harjani (IBM),
Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner,
Vlastimil Babka, Will Deacon, Yeoreum Yun, linux-arm-kernel,
linuxppc-dev, sparclinux, xen-devel, x86
On 03/12/2025 17:08, Venkat wrote:
> [...]
> Tested this patch series by applying on top of mm-unstable, on both HASH and RADIX MMU, and all tests are passed on both MMU’s.
>
> Ran: cache_shape, copyloops, mm from linux source, selftests/powerpc/ and ran memory-hotplug from selftests/. Also ran below tests from avocado misc-test repo.
>
> Link to repo: https://github.com/avocado-framework-tests/avocado-misc-tests
>
> avocado-misc-tests/memory/stutter.py
> avocado-misc-tests/memory/eatmemory.py
> avocado-misc-tests/memory/hugepage_sanity.py
> avocado-misc-tests/memory/fork_mem.py
> avocado-misc-tests/memory/memory_api.py
> avocado-misc-tests/memory/mprotect.py
> avocado-misc-tests/memory/vatest.py avocado-misc-tests/memory/vatest.py.data/vatest.yaml
> avocado-misc-tests/memory/transparent_hugepages.py
> avocado-misc-tests/memory/transparent_hugepages_swapping.py
> avocado-misc-tests/memory/transparent_hugepages_defrag.py
> avocado-misc-tests/memory/ksm_poison.py
>
> If its good enough, please add below tag for PowerPC changes.
>
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Many thanks for the testing! Will add your tag to patch 1, 3 and 10.
- Kevin
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2025-12-05 13:00 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-24 13:22 [PATCH v5 00/12] Nesting support for lazy MMU mode Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 01/12] powerpc/64s: Do not re-activate batched TLB flush Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 02/12] x86/xen: simplify flush_lazy_mmu() Kevin Brodsky
2025-12-04 3:36 ` Anshuman Khandual
2025-11-24 13:22 ` [PATCH v5 03/12] powerpc/mm: implement arch_flush_lazy_mmu_mode() Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 04/12] sparc/mm: " Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 05/12] mm: introduce CONFIG_ARCH_HAS_LAZY_MMU_MODE Kevin Brodsky
2025-12-01 6:21 ` Anshuman Khandual
2025-12-03 8:19 ` Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 06/12] mm: introduce generic lazy_mmu helpers Kevin Brodsky
2025-11-28 13:50 ` Alexander Gordeev
2025-12-03 8:20 ` Kevin Brodsky
2025-12-04 4:17 ` Anshuman Khandual
2025-12-05 12:47 ` Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 07/12] mm: bail out of lazy_mmu_mode_* in interrupt context Kevin Brodsky
2025-11-24 14:11 ` David Hildenbrand (Red Hat)
2025-12-04 4:34 ` Anshuman Khandual
2025-11-24 13:22 ` [PATCH v5 08/12] mm: enable lazy_mmu sections to nest Kevin Brodsky
2025-11-24 14:09 ` David Hildenbrand (Red Hat)
2025-11-27 12:33 ` Alexander Gordeev
2025-11-27 12:45 ` Kevin Brodsky
2025-11-28 13:55 ` Alexander Gordeev
2025-12-03 8:20 ` Kevin Brodsky
2025-12-04 5:25 ` Anshuman Khandual
2025-12-04 11:53 ` David Hildenbrand (Red Hat)
2025-12-04 6:23 ` Anshuman Khandual
2025-12-04 11:52 ` David Hildenbrand (Red Hat)
2025-12-05 12:50 ` Kevin Brodsky
2025-12-05 12:56 ` Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 09/12] arm64: mm: replace TIF_LAZY_MMU with in_lazy_mmu_mode() Kevin Brodsky
2025-11-24 14:10 ` David Hildenbrand (Red Hat)
2025-12-04 6:52 ` Anshuman Khandual
2025-12-04 11:39 ` David Hildenbrand (Red Hat)
2025-11-24 13:22 ` [PATCH v5 10/12] powerpc/mm: replace batch->active " Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 11/12] sparc/mm: " Kevin Brodsky
2025-11-24 13:22 ` [PATCH v5 12/12] x86/xen: use lazy_mmu_state when context-switching Kevin Brodsky
2025-11-24 14:18 ` David Hildenbrand (Red Hat)
2025-11-25 13:39 ` Jürgen Groß
2025-12-03 16:08 ` [PATCH v5 00/12] Nesting support for lazy MMU mode Venkat
2025-12-05 13:00 ` Kevin Brodsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox