* [PATCH 0/7] Nesting support for lazy MMU mode
@ 2025-09-04 12:57 Kevin Brodsky
2025-09-04 12:57 ` [PATCH 1/7] mm: remove arch_flush_lazy_mmu_mode() Kevin Brodsky
` (7 more replies)
0 siblings, 8 replies; 24+ messages in thread
From: Kevin Brodsky @ 2025-09-04 12:57 UTC (permalink / raw)
To: linux-mm
Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson,
Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas,
Christophe Leroy, Dave Hansen, David Hildenbrand,
David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn,
Juergen Gross, Liam R. Howlett, Lorenzo Stoakes,
Madhavan Srinivasan, Michael Ellerman, Michal Hocko,
Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts,
Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka,
Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux,
xen-devel
When the lazy MMU mode was introduced eons ago, it wasn't made clear
whether such a sequence was legal:
arch_enter_lazy_mmu_mode()
...
arch_enter_lazy_mmu_mode()
...
arch_leave_lazy_mmu_mode()
...
arch_leave_lazy_mmu_mode()
It seems fair to say that nested calls to
arch_{enter,leave}_lazy_mmu_mode() were not expected, and most
architectures never explicitly supported it.
Ryan Roberts' series from March [1] attempted to prevent nesting from
ever occurring, and mostly succeeded. Unfortunately, a corner case
(DEBUG_PAGEALLOC) may still cause nesting to occur on arm64. Ryan
proposed [2] to address that corner case at the generic level but this
approach received pushback; [3] then attempted to solve the issue on
arm64 only, but it was deemed too fragile.
It feels generally fragile to rely on lazy_mmu sections not to nest,
because callers of various standard mm functions do not know if the
function uses lazy_mmu itself. This series therefore performs a U-turn
and adds support for nested lazy_mmu sections, on all architectures.
The main change enabling nesting is patch 2, following the approach
suggested by Catalin Marinas [4]: have enter() return some state and
the matching leave() take that state. In this series, the state is only
used to handle nesting, but it could be used for other purposes such as
restoring context modified by enter(); the proposed kpkeys framework
would be an immediate user [5].
Patch overview:
* Patch 1: general cleanup - not directly related, but avoids any doubt
regarding the expected behaviour of arch_flush_lazy_mmu_mode() outside
x86
* Patch 2: main API change, no functional change
* Patch 3-6: nesting support for all architectures that support lazy_mmu
* Patch 7: clarification that nesting is supported in the documentation
Patch 4-6 are technically not required at this stage since nesting is
only observed on arm64, but they ensure future correctness in case
nesting is (re)introduced in generic paths. For instance, it could be
beneficial in some configurations to enter lazy_mmu set_ptes() once
again.
This series has been tested by running the mm kselfetsts on arm64 with
DEBUG_PAGEALLOC and KFENCE. It was also build-tested on other
architectures (with and without XEN_PV on x86).
- Kevin
[1] https://lore.kernel.org/all/20250303141542.3371656-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/all/20250530140446.2387131-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/all/20250606135654.178300-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/all/aEhKSq0zVaUJkomX@arm.com/
[5] https://lore.kernel.org/linux-hardening/20250815085512.2182322-19-kevin.brodsky@arm.com/
---
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
---
Kevin Brodsky (7):
mm: remove arch_flush_lazy_mmu_mode()
mm: introduce local state for lazy_mmu sections
arm64: mm: fully support nested lazy_mmu sections
x86/xen: support nested lazy_mmu sections (again)
powerpc/mm: support nested lazy_mmu sections
sparc/mm: support nested lazy_mmu sections
mm: update lazy_mmu documentation
arch/arm64/include/asm/pgtable.h | 34 ++++++-------------
.../include/asm/book3s/64/tlbflush-hash.h | 24 +++++++++----
arch/powerpc/mm/book3s64/hash_tlb.c | 10 +++---
arch/powerpc/mm/book3s64/subpage_prot.c | 5 +--
arch/sparc/include/asm/tlbflush_64.h | 6 ++--
arch/sparc/mm/tlb.c | 19 ++++++++---
arch/x86/include/asm/paravirt.h | 8 ++---
arch/x86/include/asm/paravirt_types.h | 6 ++--
arch/x86/include/asm/pgtable.h | 3 +-
arch/x86/xen/enlighten_pv.c | 2 +-
arch/x86/xen/mmu_pv.c | 13 ++++---
fs/proc/task_mmu.c | 5 +--
include/linux/mm_types.h | 3 ++
include/linux/pgtable.h | 21 +++++++++---
mm/madvise.c | 20 ++++++-----
mm/memory.c | 20 ++++++-----
mm/migrate_device.c | 5 +--
mm/mprotect.c | 5 +--
mm/mremap.c | 5 +--
mm/vmalloc.c | 15 ++++----
mm/vmscan.c | 15 ++++----
21 files changed, 147 insertions(+), 97 deletions(-)
base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0
--
2.47.0
^ permalink raw reply [flat|nested] 24+ messages in thread* [PATCH 1/7] mm: remove arch_flush_lazy_mmu_mode() 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky @ 2025-09-04 12:57 ` Kevin Brodsky 2025-09-05 11:00 ` Mike Rapoport 2025-09-04 12:57 ` [PATCH 2/7] mm: introduce local state for lazy_mmu sections Kevin Brodsky ` (6 subsequent siblings) 7 siblings, 1 reply; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 12:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel This function has only ever been used in arch/x86, so there is no need for other architectures to implement it. Remove it from linux/pgtable.h and all architectures besides x86. The arm64 implementation is not empty but it is only called from arch_leave_lazy_mmu_mode(), so we can simply fold it there. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> --- arch/arm64/include/asm/pgtable.h | 9 +-------- arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 -- arch/sparc/include/asm/tlbflush_64.h | 1 - arch/x86/include/asm/pgtable.h | 3 ++- include/linux/pgtable.h | 1 - 5 files changed, 3 insertions(+), 13 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index abd2dee416b3..728d7b6ed20a 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -101,21 +101,14 @@ static inline void arch_enter_lazy_mmu_mode(void) set_thread_flag(TIF_LAZY_MMU); } -static inline void arch_flush_lazy_mmu_mode(void) +static inline void arch_leave_lazy_mmu_mode(void) { if (in_interrupt()) return; if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING)) emit_pte_barriers(); -} - -static inline void arch_leave_lazy_mmu_mode(void) -{ - if (in_interrupt()) - return; - arch_flush_lazy_mmu_mode(); clear_thread_flag(TIF_LAZY_MMU); } diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h index 146287d9580f..176d7fd79eeb 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h @@ -55,8 +55,6 @@ static inline void arch_leave_lazy_mmu_mode(void) preempt_enable(); } -#define arch_flush_lazy_mmu_mode() do {} while (0) - extern void hash__tlbiel_all(unsigned int action); extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h index 8b8cdaa69272..cd144eb31bdd 100644 --- a/arch/sparc/include/asm/tlbflush_64.h +++ b/arch/sparc/include/asm/tlbflush_64.h @@ -44,7 +44,6 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end); void flush_tlb_pending(void); void arch_enter_lazy_mmu_mode(void); void arch_leave_lazy_mmu_mode(void); -#define arch_flush_lazy_mmu_mode() do {} while (0) /* Local cpu only. */ void __flush_tlb_all(void); diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index e33df3da6980..14fd672bc9b2 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -117,7 +117,8 @@ extern pmdval_t early_pmd_flags; #define pte_val(x) native_pte_val(x) #define __pte(x) native_make_pte(x) -#define arch_end_context_switch(prev) do {} while(0) +#define arch_end_context_switch(prev) do {} while (0) +#define arch_flush_lazy_mmu_mode() do {} while (0) #endif /* CONFIG_PARAVIRT_XXL */ static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 4c035637eeb7..8848e132a6be 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -234,7 +234,6 @@ static inline int pmd_dirty(pmd_t pmd) #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE #define arch_enter_lazy_mmu_mode() do {} while (0) #define arch_leave_lazy_mmu_mode() do {} while (0) -#define arch_flush_lazy_mmu_mode() do {} while (0) #endif #ifndef pte_batch_hint -- 2.47.0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 1/7] mm: remove arch_flush_lazy_mmu_mode() 2025-09-04 12:57 ` [PATCH 1/7] mm: remove arch_flush_lazy_mmu_mode() Kevin Brodsky @ 2025-09-05 11:00 ` Mike Rapoport 0 siblings, 0 replies; 24+ messages in thread From: Mike Rapoport @ 2025-09-05 11:00 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On Thu, Sep 04, 2025 at 01:57:30PM +0100, Kevin Brodsky wrote: > This function has only ever been used in arch/x86, so there is no > need for other architectures to implement it. Remove it from > linux/pgtable.h and all architectures besides x86. > > The arm64 implementation is not empty but it is only called from > arch_leave_lazy_mmu_mode(), so we can simply fold it there. > > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > --- > arch/arm64/include/asm/pgtable.h | 9 +-------- > arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 2 -- > arch/sparc/include/asm/tlbflush_64.h | 1 - > arch/x86/include/asm/pgtable.h | 3 ++- > include/linux/pgtable.h | 1 - > 5 files changed, 3 insertions(+), 13 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index abd2dee416b3..728d7b6ed20a 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -101,21 +101,14 @@ static inline void arch_enter_lazy_mmu_mode(void) > set_thread_flag(TIF_LAZY_MMU); > } > > -static inline void arch_flush_lazy_mmu_mode(void) > +static inline void arch_leave_lazy_mmu_mode(void) > { > if (in_interrupt()) > return; > > if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING)) > emit_pte_barriers(); > -} > - > -static inline void arch_leave_lazy_mmu_mode(void) > -{ > - if (in_interrupt()) > - return; > > - arch_flush_lazy_mmu_mode(); > clear_thread_flag(TIF_LAZY_MMU); > } > > diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > index 146287d9580f..176d7fd79eeb 100644 > --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > @@ -55,8 +55,6 @@ static inline void arch_leave_lazy_mmu_mode(void) > preempt_enable(); > } > > -#define arch_flush_lazy_mmu_mode() do {} while (0) > - > extern void hash__tlbiel_all(unsigned int action); > > extern void flush_hash_page(unsigned long vpn, real_pte_t pte, int psize, > diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h > index 8b8cdaa69272..cd144eb31bdd 100644 > --- a/arch/sparc/include/asm/tlbflush_64.h > +++ b/arch/sparc/include/asm/tlbflush_64.h > @@ -44,7 +44,6 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end); > void flush_tlb_pending(void); > void arch_enter_lazy_mmu_mode(void); > void arch_leave_lazy_mmu_mode(void); > -#define arch_flush_lazy_mmu_mode() do {} while (0) > > /* Local cpu only. */ > void __flush_tlb_all(void); > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > index e33df3da6980..14fd672bc9b2 100644 > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -117,7 +117,8 @@ extern pmdval_t early_pmd_flags; > #define pte_val(x) native_pte_val(x) > #define __pte(x) native_make_pte(x) > > -#define arch_end_context_switch(prev) do {} while(0) > +#define arch_end_context_switch(prev) do {} while (0) > +#define arch_flush_lazy_mmu_mode() do {} while (0) > #endif /* CONFIG_PARAVIRT_XXL */ > > static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set) > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 4c035637eeb7..8848e132a6be 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -234,7 +234,6 @@ static inline int pmd_dirty(pmd_t pmd) > #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE > #define arch_enter_lazy_mmu_mode() do {} while (0) > #define arch_leave_lazy_mmu_mode() do {} while (0) > -#define arch_flush_lazy_mmu_mode() do {} while (0) > #endif > > #ifndef pte_batch_hint > -- > 2.47.0 > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky 2025-09-04 12:57 ` [PATCH 1/7] mm: remove arch_flush_lazy_mmu_mode() Kevin Brodsky @ 2025-09-04 12:57 ` Kevin Brodsky 2025-09-04 15:06 ` Yeoreum Yun ` (2 more replies) 2025-09-04 12:57 ` [PATCH 3/7] arm64: mm: fully support nested " Kevin Brodsky ` (5 subsequent siblings) 7 siblings, 3 replies; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 12:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel arch_{enter,leave}_lazy_mmu_mode() currently have a stateless API (taking and returning no value). This is proving problematic in situations where leave() needs to restore some context back to its original state (before enter() was called). In particular, this makes it difficult to support the nesting of lazy_mmu sections - leave() does not know whether the matching enter() call occurred while lazy_mmu was already enabled, and whether to disable it or not. This patch gives all architectures the chance to store local state while inside a lazy_mmu section by making enter() return some value, storing it in a local variable, and having leave() take that value. That value is typed lazy_mmu_state_t - each architecture defining __HAVE_ARCH_ENTER_LAZY_MMU_MODE is free to define it as it sees fit. For now we define it as int everywhere, which is sufficient to support nesting. The diff is unfortunately rather large as all the API changes need to be done atomically. Main parts: * Changing the prototypes of arch_{enter,leave}_lazy_mmu_mode() in generic and arch code, and introducing lazy_mmu_state_t. * Introducing LAZY_MMU_{DEFAULT,NESTED} for future support of nesting. enter() always returns LAZY_MMU_DEFAULT for now. (linux/mm_types.h is not the most natural location for defining those constants, but there is no other obvious header that is accessible where arch's implement the helpers.) * Changing all lazy_mmu sections to introduce a lazy_mmu_state local variable, having enter() set it and leave() take it. Most of these changes were generated using the Coccinelle script below. @@ @@ { + lazy_mmu_state_t lazy_mmu_state; ... - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); ... - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); ... } Note: it is difficult to provide a default definition of lazy_mmu_state_t for architectures implementing lazy_mmu, because that definition would need to be available in arch/x86/include/asm/paravirt_types.h and adding a new generic #include there is very tricky due to the existing header soup. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> --- arch/arm64/include/asm/pgtable.h | 10 +++++++--- .../include/asm/book3s/64/tlbflush-hash.h | 9 ++++++--- arch/powerpc/mm/book3s64/hash_tlb.c | 10 ++++++---- arch/powerpc/mm/book3s64/subpage_prot.c | 5 +++-- arch/sparc/include/asm/tlbflush_64.h | 5 +++-- arch/sparc/mm/tlb.c | 6 ++++-- arch/x86/include/asm/paravirt.h | 6 ++++-- arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/xen/enlighten_pv.c | 2 +- arch/x86/xen/mmu_pv.c | 2 +- fs/proc/task_mmu.c | 5 +++-- include/linux/mm_types.h | 3 +++ include/linux/pgtable.h | 6 ++++-- mm/madvise.c | 20 ++++++++++--------- mm/memory.c | 20 +++++++++++-------- mm/migrate_device.c | 5 +++-- mm/mprotect.c | 5 +++-- mm/mremap.c | 5 +++-- mm/vmalloc.c | 15 ++++++++------ mm/vmscan.c | 15 ++++++++------ 20 files changed, 97 insertions(+), 59 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 728d7b6ed20a..816197d08165 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -81,7 +81,9 @@ static inline void queue_pte_barriers(void) } #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE -static inline void arch_enter_lazy_mmu_mode(void) +typedef int lazy_mmu_state_t; + +static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) { /* * lazy_mmu_mode is not supposed to permit nesting. But in practice this @@ -96,12 +98,14 @@ static inline void arch_enter_lazy_mmu_mode(void) */ if (in_interrupt()) - return; + return LAZY_MMU_DEFAULT; set_thread_flag(TIF_LAZY_MMU); + + return LAZY_MMU_DEFAULT; } -static inline void arch_leave_lazy_mmu_mode(void) +static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) { if (in_interrupt()) return; diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h index 176d7fd79eeb..c9f1e819e567 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h @@ -25,13 +25,14 @@ DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch); extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch); #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE +typedef int lazy_mmu_state_t; -static inline void arch_enter_lazy_mmu_mode(void) +static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) { struct ppc64_tlb_batch *batch; if (radix_enabled()) - return; + return LAZY_MMU_DEFAULT; /* * apply_to_page_range can call us this preempt enabled when * operating on kernel page tables. @@ -39,9 +40,11 @@ static inline void arch_enter_lazy_mmu_mode(void) preempt_disable(); batch = this_cpu_ptr(&ppc64_tlb_batch); batch->active = 1; + + return LAZY_MMU_DEFAULT; } -static inline void arch_leave_lazy_mmu_mode(void) +static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) { struct ppc64_tlb_batch *batch; diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c index 21fcad97ae80..ee664f88e679 100644 --- a/arch/powerpc/mm/book3s64/hash_tlb.c +++ b/arch/powerpc/mm/book3s64/hash_tlb.c @@ -189,6 +189,7 @@ void hash__tlb_flush(struct mmu_gather *tlb) */ void __flush_hash_table_range(unsigned long start, unsigned long end) { + lazy_mmu_state_t lazy_mmu_state; int hugepage_shift; unsigned long flags; @@ -205,7 +206,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end) * way to do things but is fine for our needs here. */ local_irq_save(flags); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); for (; start < end; start += PAGE_SIZE) { pte_t *ptep = find_init_mm_pte(start, &hugepage_shift); unsigned long pte; @@ -217,12 +218,13 @@ void __flush_hash_table_range(unsigned long start, unsigned long end) continue; hpte_need_flush(&init_mm, start, ptep, pte, hugepage_shift); } - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); local_irq_restore(flags); } void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) { + lazy_mmu_state_t lazy_mmu_state; pte_t *pte; pte_t *start_pte; unsigned long flags; @@ -237,7 +239,7 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long * way to do things but is fine for our needs here. */ local_irq_save(flags); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); start_pte = pte_offset_map(pmd, addr); if (!start_pte) goto out; @@ -249,6 +251,6 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long } pte_unmap(start_pte); out: - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); local_irq_restore(flags); } diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c index ec98e526167e..4720f9f321af 100644 --- a/arch/powerpc/mm/book3s64/subpage_prot.c +++ b/arch/powerpc/mm/book3s64/subpage_prot.c @@ -53,6 +53,7 @@ void subpage_prot_free(struct mm_struct *mm) static void hpte_flush_range(struct mm_struct *mm, unsigned long addr, int npages) { + lazy_mmu_state_t lazy_mmu_state; pgd_t *pgd; p4d_t *p4d; pud_t *pud; @@ -73,13 +74,13 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr, pte = pte_offset_map_lock(mm, pmd, addr, &ptl); if (!pte) return; - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); for (; npages > 0; --npages) { pte_update(mm, addr, pte, 0, 0, 0); addr += PAGE_SIZE; ++pte; } - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(pte - 1, ptl); } diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h index cd144eb31bdd..02c93a4e6af5 100644 --- a/arch/sparc/include/asm/tlbflush_64.h +++ b/arch/sparc/include/asm/tlbflush_64.h @@ -40,10 +40,11 @@ static inline void flush_tlb_range(struct vm_area_struct *vma, void flush_tlb_kernel_range(unsigned long start, unsigned long end); #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE +typedef int lazy_mmu_state_t; void flush_tlb_pending(void); -void arch_enter_lazy_mmu_mode(void); -void arch_leave_lazy_mmu_mode(void); +lazy_mmu_state_t arch_enter_lazy_mmu_mode(void); +void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state); /* Local cpu only. */ void __flush_tlb_all(void); diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c index a35ddcca5e76..bf5094b770af 100644 --- a/arch/sparc/mm/tlb.c +++ b/arch/sparc/mm/tlb.c @@ -50,16 +50,18 @@ void flush_tlb_pending(void) put_cpu_var(tlb_batch); } -void arch_enter_lazy_mmu_mode(void) +lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) { struct tlb_batch *tb; preempt_disable(); tb = this_cpu_ptr(&tlb_batch); tb->active = 1; + + return LAZY_MMU_DEFAULT; } -void arch_leave_lazy_mmu_mode(void) +void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) { struct tlb_batch *tb = this_cpu_ptr(&tlb_batch); diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index b5e59a7ba0d0..65a0d394fba1 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -527,12 +527,14 @@ static inline void arch_end_context_switch(struct task_struct *next) } #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE -static inline void arch_enter_lazy_mmu_mode(void) +static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) { PVOP_VCALL0(mmu.lazy_mode.enter); + + return LAZY_MMU_DEFAULT; } -static inline void arch_leave_lazy_mmu_mode(void) +static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) { PVOP_VCALL0(mmu.lazy_mode.leave); } diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 37a8627d8277..bc1af86868a3 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -41,6 +41,8 @@ struct pv_info { }; #ifdef CONFIG_PARAVIRT_XXL +typedef int lazy_mmu_state_t; + struct pv_lazy_ops { /* Set deferred update mode, used for batching operations. */ void (*enter)(void); diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index 26bbaf4b7330..a245ba47a631 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -426,7 +426,7 @@ static void xen_start_context_switch(struct task_struct *prev) BUG_ON(preemptible()); if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) { - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(LAZY_MMU_DEFAULT); set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU_UPDATES); } enter_lazy(XEN_LAZY_CPU); diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 2a4a8deaf612..2039d5132ca3 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2140,7 +2140,7 @@ static void xen_flush_lazy_mmu(void) preempt_disable(); if (xen_get_lazy_mode() == XEN_LAZY_MMU) { - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(LAZY_MMU_DEFAULT); arch_enter_lazy_mmu_mode(); } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 29cca0e6d0ff..c9bf1128a4cd 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2610,6 +2610,7 @@ static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start, static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, unsigned long end, struct mm_walk *walk) { + lazy_mmu_state_t lazy_mmu_state; struct pagemap_scan_private *p = walk->private; struct vm_area_struct *vma = walk->vma; unsigned long addr, flush_end = 0; @@ -2628,7 +2629,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, return 0; } - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) { /* Fast path for performing exclusive WP */ @@ -2698,7 +2699,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, if (flush_end) flush_tlb_range(vma, start, addr); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(start_pte, ptl); cond_resched(); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 08bc2442db93..18745c32f2c0 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1441,6 +1441,9 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm); extern void tlb_finish_mmu(struct mmu_gather *tlb); +#define LAZY_MMU_DEFAULT 0 +#define LAZY_MMU_NESTED 1 + struct vm_fault; /** diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 8848e132a6be..6932c8e344ab 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -232,8 +232,10 @@ static inline int pmd_dirty(pmd_t pmd) * and the mode cannot be used in interrupt context. */ #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE -#define arch_enter_lazy_mmu_mode() do {} while (0) -#define arch_leave_lazy_mmu_mode() do {} while (0) +typedef int lazy_mmu_state_t; + +#define arch_enter_lazy_mmu_mode() (LAZY_MMU_DEFAULT) +#define arch_leave_lazy_mmu_mode(state) ((void)(state)) #endif #ifndef pte_batch_hint diff --git a/mm/madvise.c b/mm/madvise.c index 35ed4ab0d7c5..72c032f2cf56 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -357,6 +357,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk) { + lazy_mmu_state_t lazy_mmu_state; struct madvise_walk_private *private = walk->private; struct mmu_gather *tlb = private->tlb; bool pageout = private->pageout; @@ -455,7 +456,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (!start_pte) return 0; flush_tlb_batched_pending(mm); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { nr = 1; ptent = ptep_get(pte); @@ -463,7 +464,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (++batch_count == SWAP_CLUSTER_MAX) { batch_count = 0; if (need_resched()) { - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(start_pte, ptl); cond_resched(); goto restart; @@ -499,7 +500,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (!folio_trylock(folio)) continue; folio_get(folio); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(start_pte, ptl); start_pte = NULL; err = split_folio(folio); @@ -510,7 +511,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (!start_pte) break; flush_tlb_batched_pending(mm); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); if (!err) nr = 0; continue; @@ -558,7 +559,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, } if (start_pte) { - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(start_pte, ptl); } if (pageout) @@ -657,6 +658,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, { const cydp_t cydp_flags = CYDP_CLEAR_YOUNG | CYDP_CLEAR_DIRTY; + lazy_mmu_state_t lazy_mmu_state; struct mmu_gather *tlb = walk->private; struct mm_struct *mm = tlb->mm; struct vm_area_struct *vma = walk->vma; @@ -677,7 +679,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (!start_pte) return 0; flush_tlb_batched_pending(mm); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { nr = 1; ptent = ptep_get(pte); @@ -727,7 +729,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (!folio_trylock(folio)) continue; folio_get(folio); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(start_pte, ptl); start_pte = NULL; err = split_folio(folio); @@ -738,7 +740,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (!start_pte) break; flush_tlb_batched_pending(mm); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); if (!err) nr = 0; continue; @@ -778,7 +780,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (nr_swap) add_mm_counter(mm, MM_SWAPENTS, nr_swap); if (start_pte) { - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(start_pte, ptl); } cond_resched(); diff --git a/mm/memory.c b/mm/memory.c index 0ba4f6b71847..ebe0ffddcb77 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1079,6 +1079,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, unsigned long end) { + lazy_mmu_state_t lazy_mmu_state; struct mm_struct *dst_mm = dst_vma->vm_mm; struct mm_struct *src_mm = src_vma->vm_mm; pte_t *orig_src_pte, *orig_dst_pte; @@ -1126,7 +1127,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); orig_src_pte = src_pte; orig_dst_pte = dst_pte; - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); do { nr = 1; @@ -1195,7 +1196,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr, addr != end); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(orig_src_pte, src_ptl); add_mm_rss_vec(dst_mm, rss); pte_unmap_unlock(orig_dst_pte, dst_ptl); @@ -1694,6 +1695,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end, struct zap_details *details) { + lazy_mmu_state_t lazy_mmu_state; bool force_flush = false, force_break = false; struct mm_struct *mm = tlb->mm; int rss[NR_MM_COUNTERS]; @@ -1714,7 +1716,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, return addr; flush_tlb_batched_pending(mm); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); do { bool any_skipped = false; @@ -1746,7 +1748,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval); add_mm_rss_vec(mm, rss); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); /* Do the actual TLB flush before dropping ptl */ if (force_flush) { @@ -2683,6 +2685,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, unsigned long end, unsigned long pfn, pgprot_t prot) { + lazy_mmu_state_t lazy_mmu_state; pte_t *pte, *mapped_pte; spinlock_t *ptl; int err = 0; @@ -2690,7 +2693,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); if (!pte) return -ENOMEM; - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); do { BUG_ON(!pte_none(ptep_get(pte))); if (!pfn_modify_allowed(pfn, prot)) { @@ -2700,7 +2703,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot))); pfn++; } while (pte++, addr += PAGE_SIZE, addr != end); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(mapped_pte, ptl); return err; } @@ -2989,6 +2992,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, pte_fn_t fn, void *data, bool create, pgtbl_mod_mask *mask) { + lazy_mmu_state_t lazy_mmu_state; pte_t *pte, *mapped_pte; int err = 0; spinlock_t *ptl; @@ -3007,7 +3011,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, return -EINVAL; } - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); if (fn) { do { @@ -3020,7 +3024,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, } *mask |= PGTBL_PTE_MODIFIED; - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); if (mm != &init_mm) pte_unmap_unlock(mapped_pte, ptl); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index e05e14d6eacd..659285c6ba77 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -59,6 +59,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, unsigned long end, struct mm_walk *walk) { + lazy_mmu_state_t lazy_mmu_state; struct migrate_vma *migrate = walk->private; struct folio *fault_folio = migrate->fault_page ? page_folio(migrate->fault_page) : NULL; @@ -110,7 +111,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); if (!ptep) goto again; - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); for (; addr < end; addr += PAGE_SIZE, ptep++) { struct dev_pagemap *pgmap; @@ -287,7 +288,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (unmapped) flush_tlb_range(walk->vma, start, end); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(ptep - 1, ptl); return 0; diff --git a/mm/mprotect.c b/mm/mprotect.c index 113b48985834..7bba651e5aa3 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -273,6 +273,7 @@ static long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) { + lazy_mmu_state_t lazy_mmu_state; pte_t *pte, oldpte; spinlock_t *ptl; long pages = 0; @@ -293,7 +294,7 @@ static long change_pte_range(struct mmu_gather *tlb, target_node = numa_node_id(); flush_tlb_batched_pending(vma->vm_mm); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); do { nr_ptes = 1; oldpte = ptep_get(pte); @@ -439,7 +440,7 @@ static long change_pte_range(struct mmu_gather *tlb, } } } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(pte - 1, ptl); return pages; diff --git a/mm/mremap.c b/mm/mremap.c index e618a706aff5..dac29a734e16 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -193,6 +193,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr static int move_ptes(struct pagetable_move_control *pmc, unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd) { + lazy_mmu_state_t lazy_mmu_state; struct vm_area_struct *vma = pmc->old; bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma); struct mm_struct *mm = vma->vm_mm; @@ -256,7 +257,7 @@ static int move_ptes(struct pagetable_move_control *pmc, if (new_ptl != old_ptl) spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); flush_tlb_batched_pending(vma->vm_mm); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE, new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) { @@ -301,7 +302,7 @@ static int move_ptes(struct pagetable_move_control *pmc, } } - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); if (force_flush) flush_tlb_range(vma, old_end - len, old_end); if (new_ptl != old_ptl) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6dbcdceecae1..f901675dd060 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -95,6 +95,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift, pgtbl_mod_mask *mask) { + lazy_mmu_state_t lazy_mmu_state; pte_t *pte; u64 pfn; struct page *page; @@ -105,7 +106,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, if (!pte) return -ENOMEM; - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); do { if (unlikely(!pte_none(ptep_get(pte)))) { @@ -131,7 +132,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pfn++; } while (pte += PFN_DOWN(size), addr += size, addr != end); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); *mask |= PGTBL_PTE_MODIFIED; return 0; } @@ -354,12 +355,13 @@ int ioremap_page_range(unsigned long addr, unsigned long end, static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pgtbl_mod_mask *mask) { + lazy_mmu_state_t lazy_mmu_state; pte_t *pte; pte_t ptent; unsigned long size = PAGE_SIZE; pte = pte_offset_kernel(pmd, addr); - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); do { #ifdef CONFIG_HUGETLB_PAGE @@ -378,7 +380,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, WARN_ON(!pte_none(ptent) && !pte_present(ptent)); } while (pte += (size >> PAGE_SHIFT), addr += size, addr != end); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); *mask |= PGTBL_PTE_MODIFIED; } @@ -514,6 +516,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask) { + lazy_mmu_state_t lazy_mmu_state; int err = 0; pte_t *pte; @@ -526,7 +529,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, if (!pte) return -ENOMEM; - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); do { struct page *page = pages[*nr]; @@ -548,7 +551,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, (*nr)++; } while (pte++, addr += PAGE_SIZE, addr != end); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); *mask |= PGTBL_PTE_MODIFIED; return err; diff --git a/mm/vmscan.c b/mm/vmscan.c index a48aec8bfd92..13b6657c8743 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3521,6 +3521,7 @@ static void walk_update_folio(struct lru_gen_mm_walk *walk, struct folio *folio, static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, struct mm_walk *args) { + lazy_mmu_state_t lazy_mmu_state; int i; bool dirty; pte_t *pte; @@ -3550,7 +3551,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, return false; } - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); restart: for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { unsigned long pfn; @@ -3591,7 +3592,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end)) goto restart; - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); pte_unmap_unlock(pte, ptl); return suitable_to_scan(total, young); @@ -3600,6 +3601,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma, struct mm_walk *args, unsigned long *bitmap, unsigned long *first) { + lazy_mmu_state_t lazy_mmu_state; int i; bool dirty; pmd_t *pmd; @@ -3632,7 +3634,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area if (!spin_trylock(ptl)) goto done; - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); do { unsigned long pfn; @@ -3679,7 +3681,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area walk_update_folio(walk, last, gen, dirty); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); spin_unlock(ptl); done: *first = -1; @@ -4227,6 +4229,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) */ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) { + lazy_mmu_state_t lazy_mmu_state; int i; bool dirty; unsigned long start; @@ -4278,7 +4281,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) } } - arch_enter_lazy_mmu_mode(); + lazy_mmu_state = arch_enter_lazy_mmu_mode(); pte -= (addr - start) / PAGE_SIZE; @@ -4312,7 +4315,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) walk_update_folio(walk, last, gen, dirty); - arch_leave_lazy_mmu_mode(); + arch_leave_lazy_mmu_mode(lazy_mmu_state); /* feedback from rmap walkers to page table walkers */ if (mm_state && suitable_to_scan(i, young)) -- 2.47.0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-04 12:57 ` [PATCH 2/7] mm: introduce local state for lazy_mmu sections Kevin Brodsky @ 2025-09-04 15:06 ` Yeoreum Yun 2025-09-04 15:47 ` Kevin Brodsky 2025-09-04 17:28 ` Lorenzo Stoakes 2025-09-05 11:19 ` Mike Rapoport 2 siblings, 1 reply; 24+ messages in thread From: Yeoreum Yun @ 2025-09-04 15:06 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel Hi Kevin, [...] > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> > --- > arch/arm64/include/asm/pgtable.h | 10 +++++++--- > .../include/asm/book3s/64/tlbflush-hash.h | 9 ++++++--- > arch/powerpc/mm/book3s64/hash_tlb.c | 10 ++++++---- > arch/powerpc/mm/book3s64/subpage_prot.c | 5 +++-- > arch/sparc/include/asm/tlbflush_64.h | 5 +++-- > arch/sparc/mm/tlb.c | 6 ++++-- > arch/x86/include/asm/paravirt.h | 6 ++++-- > arch/x86/include/asm/paravirt_types.h | 2 ++ > arch/x86/xen/enlighten_pv.c | 2 +- > arch/x86/xen/mmu_pv.c | 2 +- > fs/proc/task_mmu.c | 5 +++-- > include/linux/mm_types.h | 3 +++ > include/linux/pgtable.h | 6 ++++-- > mm/madvise.c | 20 ++++++++++--------- > mm/memory.c | 20 +++++++++++-------- > mm/migrate_device.c | 5 +++-- > mm/mprotect.c | 5 +++-- > mm/mremap.c | 5 +++-- > mm/vmalloc.c | 15 ++++++++------ > mm/vmscan.c | 15 ++++++++------ > 20 files changed, 97 insertions(+), 59 deletions(-) I think you miss the mm/kasan/shadow.c But here, the usage is like: static int kasan_populate_vmalloc_pte() { ... arch_leave_lazy_mmu_mode(); ... arch_enter_lazy_mmu_mode(); ... } Might be you can call the arch_leave_lazy_mmu_mode() with LAZY_MMU_DEFAULT in here since I think kasan_populate_vmalloc_pte() wouldn't be called nestly. [...] Thanks. -- Sincerely, Yeoreum Yun ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-04 15:06 ` Yeoreum Yun @ 2025-09-04 15:47 ` Kevin Brodsky 0 siblings, 0 replies; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 15:47 UTC (permalink / raw) To: Yeoreum Yun Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On 04/09/2025 17:06, Yeoreum Yun wrote: > Hi Kevin, > > [...] >> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> >> --- >> arch/arm64/include/asm/pgtable.h | 10 +++++++--- >> .../include/asm/book3s/64/tlbflush-hash.h | 9 ++++++--- >> arch/powerpc/mm/book3s64/hash_tlb.c | 10 ++++++---- >> arch/powerpc/mm/book3s64/subpage_prot.c | 5 +++-- >> arch/sparc/include/asm/tlbflush_64.h | 5 +++-- >> arch/sparc/mm/tlb.c | 6 ++++-- >> arch/x86/include/asm/paravirt.h | 6 ++++-- >> arch/x86/include/asm/paravirt_types.h | 2 ++ >> arch/x86/xen/enlighten_pv.c | 2 +- >> arch/x86/xen/mmu_pv.c | 2 +- >> fs/proc/task_mmu.c | 5 +++-- >> include/linux/mm_types.h | 3 +++ >> include/linux/pgtable.h | 6 ++++-- >> mm/madvise.c | 20 ++++++++++--------- >> mm/memory.c | 20 +++++++++++-------- >> mm/migrate_device.c | 5 +++-- >> mm/mprotect.c | 5 +++-- >> mm/mremap.c | 5 +++-- >> mm/vmalloc.c | 15 ++++++++------ >> mm/vmscan.c | 15 ++++++++------ >> 20 files changed, 97 insertions(+), 59 deletions(-) > I think you miss the mm/kasan/shadow.c Ah yes that's because my series is based on v6.17-rc4 but [1] isn't in mainline yet. I'll rebase v2 on top of mm-stable. [1] https://lore.kernel.org/all/0d2efb7ddddbff6b288fbffeeb10166e90771718.1755528662.git.agordeev@linux.ibm.com/ > But here, the usage is like: > > static int kasan_populate_vmalloc_pte() > { > ... > arch_leave_lazy_mmu_mode(); > ... > arch_enter_lazy_mmu_mode(); > ... > } > > Might be you can call the arch_leave_lazy_mmu_mode() with LAZY_MMU_DEFAULT > in here since I think kasan_populate_vmalloc_pte() wouldn't be called > nestly. In fact in that case it doesn't matter if the section is nested or not. We're already assuming that lazy_mmu is enabled, and we want to fully disable it so that PTE operations take effect immediately. For that to happen we must call arch_leave_lazy_mmu_mode(LAZY_MMU_DEFAULT). We will then re-enable lazy_mmu, and the next call to leave() will do the right thing whether it is nested or not. It's worth nothing the same situation occurs in xen_flush_lazy_mmu() and this patch handles it in the way I've just described. I'll take care of that in v2, thanks for the heads-up! - Kevin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-04 12:57 ` [PATCH 2/7] mm: introduce local state for lazy_mmu sections Kevin Brodsky 2025-09-04 15:06 ` Yeoreum Yun @ 2025-09-04 17:28 ` Lorenzo Stoakes 2025-09-04 22:14 ` Kevin Brodsky 2025-09-05 11:19 ` Mike Rapoport 2 siblings, 1 reply; 24+ messages in thread From: Lorenzo Stoakes @ 2025-09-04 17:28 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel Hi Kevin, This is causing a build failure: In file included from ./include/linux/mm.h:31, from mm/userfaultfd.c:8: mm/userfaultfd.c: In function ‘move_present_ptes’: ./include/linux/pgtable.h:247:41: error: statement with no effect [-Werror=unused-value] 247 | #define arch_enter_lazy_mmu_mode() (LAZY_MMU_DEFAULT) | ^ mm/userfaultfd.c:1103:9: note: in expansion of macro ‘arch_enter_lazy_mmu_mode’ 1103 | arch_enter_lazy_mmu_mode(); | ^~~~~~~~~~~~~~~~~~~~~~~~ ./include/linux/pgtable.h:248:54: error: expected expression before ‘)’ token 248 | #define arch_leave_lazy_mmu_mode(state) ((void)(state)) | ^ mm/userfaultfd.c:1141:9: note: in expansion of macro ‘arch_leave_lazy_mmu_mode’ 1141 | arch_leave_lazy_mmu_mode(); | ^~~~~~~~~~~~~~~~~~~~~~~~ It seems you haven't carefully checked call sites here, please do very carefully recheck these - I see Yeoreum reported a mising kasan case, so I suggest you just aggressively grep this + make sure you've covered all bases :) Cheers, Lorenzo On Thu, Sep 04, 2025 at 01:57:31PM +0100, Kevin Brodsky wrote: > arch_{enter,leave}_lazy_mmu_mode() currently have a stateless API > (taking and returning no value). This is proving problematic in > situations where leave() needs to restore some context back to its > original state (before enter() was called). In particular, this > makes it difficult to support the nesting of lazy_mmu sections - > leave() does not know whether the matching enter() call occurred > while lazy_mmu was already enabled, and whether to disable it or > not. > > This patch gives all architectures the chance to store local state > while inside a lazy_mmu section by making enter() return some value, > storing it in a local variable, and having leave() take that value. > That value is typed lazy_mmu_state_t - each architecture defining > __HAVE_ARCH_ENTER_LAZY_MMU_MODE is free to define it as it sees fit. > For now we define it as int everywhere, which is sufficient to > support nesting. > > The diff is unfortunately rather large as all the API changes need > to be done atomically. Main parts: > > * Changing the prototypes of arch_{enter,leave}_lazy_mmu_mode() > in generic and arch code, and introducing lazy_mmu_state_t. > > * Introducing LAZY_MMU_{DEFAULT,NESTED} for future support of > nesting. enter() always returns LAZY_MMU_DEFAULT for now. > (linux/mm_types.h is not the most natural location for defining > those constants, but there is no other obvious header that is > accessible where arch's implement the helpers.) > > * Changing all lazy_mmu sections to introduce a lazy_mmu_state > local variable, having enter() set it and leave() take it. Most of > these changes were generated using the Coccinelle script below. > > @@ > @@ > { > + lazy_mmu_state_t lazy_mmu_state; > ... > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > ... > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > ... > } > > Note: it is difficult to provide a default definition of > lazy_mmu_state_t for architectures implementing lazy_mmu, because > that definition would need to be available in > arch/x86/include/asm/paravirt_types.h and adding a new generic > #include there is very tricky due to the existing header soup. > > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> > --- > arch/arm64/include/asm/pgtable.h | 10 +++++++--- > .../include/asm/book3s/64/tlbflush-hash.h | 9 ++++++--- > arch/powerpc/mm/book3s64/hash_tlb.c | 10 ++++++---- > arch/powerpc/mm/book3s64/subpage_prot.c | 5 +++-- > arch/sparc/include/asm/tlbflush_64.h | 5 +++-- > arch/sparc/mm/tlb.c | 6 ++++-- > arch/x86/include/asm/paravirt.h | 6 ++++-- > arch/x86/include/asm/paravirt_types.h | 2 ++ > arch/x86/xen/enlighten_pv.c | 2 +- > arch/x86/xen/mmu_pv.c | 2 +- > fs/proc/task_mmu.c | 5 +++-- > include/linux/mm_types.h | 3 +++ > include/linux/pgtable.h | 6 ++++-- > mm/madvise.c | 20 ++++++++++--------- > mm/memory.c | 20 +++++++++++-------- > mm/migrate_device.c | 5 +++-- > mm/mprotect.c | 5 +++-- > mm/mremap.c | 5 +++-- > mm/vmalloc.c | 15 ++++++++------ > mm/vmscan.c | 15 ++++++++------ > 20 files changed, 97 insertions(+), 59 deletions(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index 728d7b6ed20a..816197d08165 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -81,7 +81,9 @@ static inline void queue_pte_barriers(void) > } > > #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE > -static inline void arch_enter_lazy_mmu_mode(void) > +typedef int lazy_mmu_state_t; > + > +static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) > { > /* > * lazy_mmu_mode is not supposed to permit nesting. But in practice this > @@ -96,12 +98,14 @@ static inline void arch_enter_lazy_mmu_mode(void) > */ > > if (in_interrupt()) > - return; > + return LAZY_MMU_DEFAULT; > > set_thread_flag(TIF_LAZY_MMU); > + > + return LAZY_MMU_DEFAULT; > } > > -static inline void arch_leave_lazy_mmu_mode(void) > +static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) > { > if (in_interrupt()) > return; > diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > index 176d7fd79eeb..c9f1e819e567 100644 > --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h > @@ -25,13 +25,14 @@ DECLARE_PER_CPU(struct ppc64_tlb_batch, ppc64_tlb_batch); > extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch); > > #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE > +typedef int lazy_mmu_state_t; > > -static inline void arch_enter_lazy_mmu_mode(void) > +static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) > { > struct ppc64_tlb_batch *batch; > > if (radix_enabled()) > - return; > + return LAZY_MMU_DEFAULT; > /* > * apply_to_page_range can call us this preempt enabled when > * operating on kernel page tables. > @@ -39,9 +40,11 @@ static inline void arch_enter_lazy_mmu_mode(void) > preempt_disable(); > batch = this_cpu_ptr(&ppc64_tlb_batch); > batch->active = 1; > + > + return LAZY_MMU_DEFAULT; > } > > -static inline void arch_leave_lazy_mmu_mode(void) > +static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) > { > struct ppc64_tlb_batch *batch; > > diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c > index 21fcad97ae80..ee664f88e679 100644 > --- a/arch/powerpc/mm/book3s64/hash_tlb.c > +++ b/arch/powerpc/mm/book3s64/hash_tlb.c > @@ -189,6 +189,7 @@ void hash__tlb_flush(struct mmu_gather *tlb) > */ > void __flush_hash_table_range(unsigned long start, unsigned long end) > { > + lazy_mmu_state_t lazy_mmu_state; > int hugepage_shift; > unsigned long flags; > > @@ -205,7 +206,7 @@ void __flush_hash_table_range(unsigned long start, unsigned long end) > * way to do things but is fine for our needs here. > */ > local_irq_save(flags); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > for (; start < end; start += PAGE_SIZE) { > pte_t *ptep = find_init_mm_pte(start, &hugepage_shift); > unsigned long pte; > @@ -217,12 +218,13 @@ void __flush_hash_table_range(unsigned long start, unsigned long end) > continue; > hpte_need_flush(&init_mm, start, ptep, pte, hugepage_shift); > } > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > local_irq_restore(flags); > } > > void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) > { > + lazy_mmu_state_t lazy_mmu_state; > pte_t *pte; > pte_t *start_pte; > unsigned long flags; > @@ -237,7 +239,7 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long > * way to do things but is fine for our needs here. > */ > local_irq_save(flags); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > start_pte = pte_offset_map(pmd, addr); > if (!start_pte) > goto out; > @@ -249,6 +251,6 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long > } > pte_unmap(start_pte); > out: > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > local_irq_restore(flags); > } > diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c > index ec98e526167e..4720f9f321af 100644 > --- a/arch/powerpc/mm/book3s64/subpage_prot.c > +++ b/arch/powerpc/mm/book3s64/subpage_prot.c > @@ -53,6 +53,7 @@ void subpage_prot_free(struct mm_struct *mm) > static void hpte_flush_range(struct mm_struct *mm, unsigned long addr, > int npages) > { > + lazy_mmu_state_t lazy_mmu_state; > pgd_t *pgd; > p4d_t *p4d; > pud_t *pud; > @@ -73,13 +74,13 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr, > pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > if (!pte) > return; > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > for (; npages > 0; --npages) { > pte_update(mm, addr, pte, 0, 0, 0); > addr += PAGE_SIZE; > ++pte; > } > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(pte - 1, ptl); > } > > diff --git a/arch/sparc/include/asm/tlbflush_64.h b/arch/sparc/include/asm/tlbflush_64.h > index cd144eb31bdd..02c93a4e6af5 100644 > --- a/arch/sparc/include/asm/tlbflush_64.h > +++ b/arch/sparc/include/asm/tlbflush_64.h > @@ -40,10 +40,11 @@ static inline void flush_tlb_range(struct vm_area_struct *vma, > void flush_tlb_kernel_range(unsigned long start, unsigned long end); > > #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE > +typedef int lazy_mmu_state_t; > > void flush_tlb_pending(void); > -void arch_enter_lazy_mmu_mode(void); > -void arch_leave_lazy_mmu_mode(void); > +lazy_mmu_state_t arch_enter_lazy_mmu_mode(void); > +void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state); > > /* Local cpu only. */ > void __flush_tlb_all(void); > diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c > index a35ddcca5e76..bf5094b770af 100644 > --- a/arch/sparc/mm/tlb.c > +++ b/arch/sparc/mm/tlb.c > @@ -50,16 +50,18 @@ void flush_tlb_pending(void) > put_cpu_var(tlb_batch); > } > > -void arch_enter_lazy_mmu_mode(void) > +lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) > { > struct tlb_batch *tb; > > preempt_disable(); > tb = this_cpu_ptr(&tlb_batch); > tb->active = 1; > + > + return LAZY_MMU_DEFAULT; > } > > -void arch_leave_lazy_mmu_mode(void) > +void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) > { > struct tlb_batch *tb = this_cpu_ptr(&tlb_batch); > > diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h > index b5e59a7ba0d0..65a0d394fba1 100644 > --- a/arch/x86/include/asm/paravirt.h > +++ b/arch/x86/include/asm/paravirt.h > @@ -527,12 +527,14 @@ static inline void arch_end_context_switch(struct task_struct *next) > } > > #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE > -static inline void arch_enter_lazy_mmu_mode(void) > +static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) > { > PVOP_VCALL0(mmu.lazy_mode.enter); > + > + return LAZY_MMU_DEFAULT; > } > > -static inline void arch_leave_lazy_mmu_mode(void) > +static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) > { > PVOP_VCALL0(mmu.lazy_mode.leave); > } > diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h > index 37a8627d8277..bc1af86868a3 100644 > --- a/arch/x86/include/asm/paravirt_types.h > +++ b/arch/x86/include/asm/paravirt_types.h > @@ -41,6 +41,8 @@ struct pv_info { > }; > > #ifdef CONFIG_PARAVIRT_XXL > +typedef int lazy_mmu_state_t; > + > struct pv_lazy_ops { > /* Set deferred update mode, used for batching operations. */ > void (*enter)(void); > diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c > index 26bbaf4b7330..a245ba47a631 100644 > --- a/arch/x86/xen/enlighten_pv.c > +++ b/arch/x86/xen/enlighten_pv.c > @@ -426,7 +426,7 @@ static void xen_start_context_switch(struct task_struct *prev) > BUG_ON(preemptible()); > > if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) { > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(LAZY_MMU_DEFAULT); > set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU_UPDATES); > } > enter_lazy(XEN_LAZY_CPU); > diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c > index 2a4a8deaf612..2039d5132ca3 100644 > --- a/arch/x86/xen/mmu_pv.c > +++ b/arch/x86/xen/mmu_pv.c > @@ -2140,7 +2140,7 @@ static void xen_flush_lazy_mmu(void) > preempt_disable(); > > if (xen_get_lazy_mode() == XEN_LAZY_MMU) { > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(LAZY_MMU_DEFAULT); > arch_enter_lazy_mmu_mode(); > } > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 29cca0e6d0ff..c9bf1128a4cd 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -2610,6 +2610,7 @@ static int pagemap_scan_thp_entry(pmd_t *pmd, unsigned long start, > static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > unsigned long end, struct mm_walk *walk) > { > + lazy_mmu_state_t lazy_mmu_state; > struct pagemap_scan_private *p = walk->private; > struct vm_area_struct *vma = walk->vma; > unsigned long addr, flush_end = 0; > @@ -2628,7 +2629,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > return 0; > } > > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) { > /* Fast path for performing exclusive WP */ > @@ -2698,7 +2699,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > if (flush_end) > flush_tlb_range(vma, start, addr); > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(start_pte, ptl); > > cond_resched(); > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 08bc2442db93..18745c32f2c0 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1441,6 +1441,9 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm); > extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm); > extern void tlb_finish_mmu(struct mmu_gather *tlb); > > +#define LAZY_MMU_DEFAULT 0 > +#define LAZY_MMU_NESTED 1 > + > struct vm_fault; > > /** > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 8848e132a6be..6932c8e344ab 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -232,8 +232,10 @@ static inline int pmd_dirty(pmd_t pmd) > * and the mode cannot be used in interrupt context. > */ > #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE > -#define arch_enter_lazy_mmu_mode() do {} while (0) > -#define arch_leave_lazy_mmu_mode() do {} while (0) > +typedef int lazy_mmu_state_t; > + > +#define arch_enter_lazy_mmu_mode() (LAZY_MMU_DEFAULT) > +#define arch_leave_lazy_mmu_mode(state) ((void)(state)) > #endif > > #ifndef pte_batch_hint > diff --git a/mm/madvise.c b/mm/madvise.c > index 35ed4ab0d7c5..72c032f2cf56 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -357,6 +357,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > unsigned long addr, unsigned long end, > struct mm_walk *walk) > { > + lazy_mmu_state_t lazy_mmu_state; > struct madvise_walk_private *private = walk->private; > struct mmu_gather *tlb = private->tlb; > bool pageout = private->pageout; > @@ -455,7 +456,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > if (!start_pte) > return 0; > flush_tlb_batched_pending(mm); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { > nr = 1; > ptent = ptep_get(pte); > @@ -463,7 +464,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > if (++batch_count == SWAP_CLUSTER_MAX) { > batch_count = 0; > if (need_resched()) { > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(start_pte, ptl); > cond_resched(); > goto restart; > @@ -499,7 +500,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > if (!folio_trylock(folio)) > continue; > folio_get(folio); > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(start_pte, ptl); > start_pte = NULL; > err = split_folio(folio); > @@ -510,7 +511,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > if (!start_pte) > break; > flush_tlb_batched_pending(mm); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > if (!err) > nr = 0; > continue; > @@ -558,7 +559,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > } > > if (start_pte) { > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(start_pte, ptl); > } > if (pageout) > @@ -657,6 +658,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > { > const cydp_t cydp_flags = CYDP_CLEAR_YOUNG | CYDP_CLEAR_DIRTY; > + lazy_mmu_state_t lazy_mmu_state; > struct mmu_gather *tlb = walk->private; > struct mm_struct *mm = tlb->mm; > struct vm_area_struct *vma = walk->vma; > @@ -677,7 +679,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > if (!start_pte) > return 0; > flush_tlb_batched_pending(mm); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { > nr = 1; > ptent = ptep_get(pte); > @@ -727,7 +729,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > if (!folio_trylock(folio)) > continue; > folio_get(folio); > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(start_pte, ptl); > start_pte = NULL; > err = split_folio(folio); > @@ -738,7 +740,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > if (!start_pte) > break; > flush_tlb_batched_pending(mm); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > if (!err) > nr = 0; > continue; > @@ -778,7 +780,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > if (nr_swap) > add_mm_counter(mm, MM_SWAPENTS, nr_swap); > if (start_pte) { > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(start_pte, ptl); > } > cond_resched(); > diff --git a/mm/memory.c b/mm/memory.c > index 0ba4f6b71847..ebe0ffddcb77 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1079,6 +1079,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, > pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, > unsigned long end) > { > + lazy_mmu_state_t lazy_mmu_state; > struct mm_struct *dst_mm = dst_vma->vm_mm; > struct mm_struct *src_mm = src_vma->vm_mm; > pte_t *orig_src_pte, *orig_dst_pte; > @@ -1126,7 +1127,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, > spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); > orig_src_pte = src_pte; > orig_dst_pte = dst_pte; > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > do { > nr = 1; > @@ -1195,7 +1196,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, > } while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr, > addr != end); > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(orig_src_pte, src_ptl); > add_mm_rss_vec(dst_mm, rss); > pte_unmap_unlock(orig_dst_pte, dst_ptl); > @@ -1694,6 +1695,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > unsigned long addr, unsigned long end, > struct zap_details *details) > { > + lazy_mmu_state_t lazy_mmu_state; > bool force_flush = false, force_break = false; > struct mm_struct *mm = tlb->mm; > int rss[NR_MM_COUNTERS]; > @@ -1714,7 +1716,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > return addr; > > flush_tlb_batched_pending(mm); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > do { > bool any_skipped = false; > > @@ -1746,7 +1748,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval); > > add_mm_rss_vec(mm, rss); > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > > /* Do the actual TLB flush before dropping ptl */ > if (force_flush) { > @@ -2683,6 +2685,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, > unsigned long addr, unsigned long end, > unsigned long pfn, pgprot_t prot) > { > + lazy_mmu_state_t lazy_mmu_state; > pte_t *pte, *mapped_pte; > spinlock_t *ptl; > int err = 0; > @@ -2690,7 +2693,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, > mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl); > if (!pte) > return -ENOMEM; > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > do { > BUG_ON(!pte_none(ptep_get(pte))); > if (!pfn_modify_allowed(pfn, prot)) { > @@ -2700,7 +2703,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, > set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot))); > pfn++; > } while (pte++, addr += PAGE_SIZE, addr != end); > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(mapped_pte, ptl); > return err; > } > @@ -2989,6 +2992,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, > pte_fn_t fn, void *data, bool create, > pgtbl_mod_mask *mask) > { > + lazy_mmu_state_t lazy_mmu_state; > pte_t *pte, *mapped_pte; > int err = 0; > spinlock_t *ptl; > @@ -3007,7 +3011,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, > return -EINVAL; > } > > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > if (fn) { > do { > @@ -3020,7 +3024,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, > } > *mask |= PGTBL_PTE_MODIFIED; > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > > if (mm != &init_mm) > pte_unmap_unlock(mapped_pte, ptl); > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > index e05e14d6eacd..659285c6ba77 100644 > --- a/mm/migrate_device.c > +++ b/mm/migrate_device.c > @@ -59,6 +59,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > unsigned long end, > struct mm_walk *walk) > { > + lazy_mmu_state_t lazy_mmu_state; > struct migrate_vma *migrate = walk->private; > struct folio *fault_folio = migrate->fault_page ? > page_folio(migrate->fault_page) : NULL; > @@ -110,7 +111,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl); > if (!ptep) > goto again; > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > for (; addr < end; addr += PAGE_SIZE, ptep++) { > struct dev_pagemap *pgmap; > @@ -287,7 +288,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > if (unmapped) > flush_tlb_range(walk->vma, start, end); > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(ptep - 1, ptl); > > return 0; > diff --git a/mm/mprotect.c b/mm/mprotect.c > index 113b48985834..7bba651e5aa3 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -273,6 +273,7 @@ static long change_pte_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, > unsigned long end, pgprot_t newprot, unsigned long cp_flags) > { > + lazy_mmu_state_t lazy_mmu_state; > pte_t *pte, oldpte; > spinlock_t *ptl; > long pages = 0; > @@ -293,7 +294,7 @@ static long change_pte_range(struct mmu_gather *tlb, > target_node = numa_node_id(); > > flush_tlb_batched_pending(vma->vm_mm); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > do { > nr_ptes = 1; > oldpte = ptep_get(pte); > @@ -439,7 +440,7 @@ static long change_pte_range(struct mmu_gather *tlb, > } > } > } while (pte += nr_ptes, addr += nr_ptes * PAGE_SIZE, addr != end); > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(pte - 1, ptl); > > return pages; > diff --git a/mm/mremap.c b/mm/mremap.c > index e618a706aff5..dac29a734e16 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -193,6 +193,7 @@ static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr > static int move_ptes(struct pagetable_move_control *pmc, > unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd) > { > + lazy_mmu_state_t lazy_mmu_state; > struct vm_area_struct *vma = pmc->old; > bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma); > struct mm_struct *mm = vma->vm_mm; > @@ -256,7 +257,7 @@ static int move_ptes(struct pagetable_move_control *pmc, > if (new_ptl != old_ptl) > spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); > flush_tlb_batched_pending(vma->vm_mm); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE, > new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) { > @@ -301,7 +302,7 @@ static int move_ptes(struct pagetable_move_control *pmc, > } > } > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > if (force_flush) > flush_tlb_range(vma, old_end - len, old_end); > if (new_ptl != old_ptl) > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 6dbcdceecae1..f901675dd060 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -95,6 +95,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > phys_addr_t phys_addr, pgprot_t prot, > unsigned int max_page_shift, pgtbl_mod_mask *mask) > { > + lazy_mmu_state_t lazy_mmu_state; > pte_t *pte; > u64 pfn; > struct page *page; > @@ -105,7 +106,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > if (!pte) > return -ENOMEM; > > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > do { > if (unlikely(!pte_none(ptep_get(pte)))) { > @@ -131,7 +132,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > pfn++; > } while (pte += PFN_DOWN(size), addr += size, addr != end); > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > *mask |= PGTBL_PTE_MODIFIED; > return 0; > } > @@ -354,12 +355,13 @@ int ioremap_page_range(unsigned long addr, unsigned long end, > static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > pgtbl_mod_mask *mask) > { > + lazy_mmu_state_t lazy_mmu_state; > pte_t *pte; > pte_t ptent; > unsigned long size = PAGE_SIZE; > > pte = pte_offset_kernel(pmd, addr); > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > do { > #ifdef CONFIG_HUGETLB_PAGE > @@ -378,7 +380,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > WARN_ON(!pte_none(ptent) && !pte_present(ptent)); > } while (pte += (size >> PAGE_SHIFT), addr += size, addr != end); > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > *mask |= PGTBL_PTE_MODIFIED; > } > > @@ -514,6 +516,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, > unsigned long end, pgprot_t prot, struct page **pages, int *nr, > pgtbl_mod_mask *mask) > { > + lazy_mmu_state_t lazy_mmu_state; > int err = 0; > pte_t *pte; > > @@ -526,7 +529,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, > if (!pte) > return -ENOMEM; > > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > do { > struct page *page = pages[*nr]; > @@ -548,7 +551,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, > (*nr)++; > } while (pte++, addr += PAGE_SIZE, addr != end); > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > *mask |= PGTBL_PTE_MODIFIED; > > return err; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index a48aec8bfd92..13b6657c8743 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3521,6 +3521,7 @@ static void walk_update_folio(struct lru_gen_mm_walk *walk, struct folio *folio, > static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > struct mm_walk *args) > { > + lazy_mmu_state_t lazy_mmu_state; > int i; > bool dirty; > pte_t *pte; > @@ -3550,7 +3551,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > return false; > } > > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > restart: > for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) { > unsigned long pfn; > @@ -3591,7 +3592,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end)) > goto restart; > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > pte_unmap_unlock(pte, ptl); > > return suitable_to_scan(total, young); > @@ -3600,6 +3601,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end, > static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area_struct *vma, > struct mm_walk *args, unsigned long *bitmap, unsigned long *first) > { > + lazy_mmu_state_t lazy_mmu_state; > int i; > bool dirty; > pmd_t *pmd; > @@ -3632,7 +3634,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area > if (!spin_trylock(ptl)) > goto done; > > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > do { > unsigned long pfn; > @@ -3679,7 +3681,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area > > walk_update_folio(walk, last, gen, dirty); > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > spin_unlock(ptl); > done: > *first = -1; > @@ -4227,6 +4229,7 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) > */ > bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) > { > + lazy_mmu_state_t lazy_mmu_state; > int i; > bool dirty; > unsigned long start; > @@ -4278,7 +4281,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) > } > } > > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > > pte -= (addr - start) / PAGE_SIZE; > > @@ -4312,7 +4315,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) > > walk_update_folio(walk, last, gen, dirty); > > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > > /* feedback from rmap walkers to page table walkers */ > if (mm_state && suitable_to_scan(i, young)) > -- > 2.47.0 > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-04 17:28 ` Lorenzo Stoakes @ 2025-09-04 22:14 ` Kevin Brodsky 2025-09-05 11:21 ` Lorenzo Stoakes 0 siblings, 1 reply; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 22:14 UTC (permalink / raw) To: Lorenzo Stoakes Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On 04/09/2025 19:28, Lorenzo Stoakes wrote: > Hi Kevin, > > This is causing a build failure: > > In file included from ./include/linux/mm.h:31, > from mm/userfaultfd.c:8: > mm/userfaultfd.c: In function ‘move_present_ptes’: > ./include/linux/pgtable.h:247:41: error: statement with no effect [-Werror=unused-value] > 247 | #define arch_enter_lazy_mmu_mode() (LAZY_MMU_DEFAULT) > | ^ > mm/userfaultfd.c:1103:9: note: in expansion of macro ‘arch_enter_lazy_mmu_mode’ > 1103 | arch_enter_lazy_mmu_mode(); > | ^~~~~~~~~~~~~~~~~~~~~~~~ > ./include/linux/pgtable.h:248:54: error: expected expression before ‘)’ token > 248 | #define arch_leave_lazy_mmu_mode(state) ((void)(state)) > | ^ > mm/userfaultfd.c:1141:9: note: in expansion of macro ‘arch_leave_lazy_mmu_mode’ > 1141 | arch_leave_lazy_mmu_mode(); > | ^~~~~~~~~~~~~~~~~~~~~~~~ > > It seems you haven't carefully checked call sites here, please do very > carefully recheck these - I see Yeoreum reported a mising kasan case, so I > suggest you just aggressively grep this + make sure you've covered all > bases :) I did check all call sites pretty carefully and of course build-tested, but my series is based on v6.17-rc4 - just like the calls Yeoreum mentioned, the issue is that those calls are in mm-stable but not in mainline :/ I suppose I should post a v2 rebased on mm-stable ASAP then? - Kevin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-04 22:14 ` Kevin Brodsky @ 2025-09-05 11:21 ` Lorenzo Stoakes 2025-09-05 11:37 ` Lorenzo Stoakes 0 siblings, 1 reply; 24+ messages in thread From: Lorenzo Stoakes @ 2025-09-05 11:21 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On Fri, Sep 05, 2025 at 12:14:39AM +0200, Kevin Brodsky wrote: > On 04/09/2025 19:28, Lorenzo Stoakes wrote: > > Hi Kevin, > > > > This is causing a build failure: > > > > In file included from ./include/linux/mm.h:31, > > from mm/userfaultfd.c:8: > > mm/userfaultfd.c: In function ‘move_present_ptes’: > > ./include/linux/pgtable.h:247:41: error: statement with no effect [-Werror=unused-value] > > 247 | #define arch_enter_lazy_mmu_mode() (LAZY_MMU_DEFAULT) > > | ^ > > mm/userfaultfd.c:1103:9: note: in expansion of macro ‘arch_enter_lazy_mmu_mode’ > > 1103 | arch_enter_lazy_mmu_mode(); > > | ^~~~~~~~~~~~~~~~~~~~~~~~ > > ./include/linux/pgtable.h:248:54: error: expected expression before ‘)’ token > > 248 | #define arch_leave_lazy_mmu_mode(state) ((void)(state)) > > | ^ > > mm/userfaultfd.c:1141:9: note: in expansion of macro ‘arch_leave_lazy_mmu_mode’ > > 1141 | arch_leave_lazy_mmu_mode(); > > | ^~~~~~~~~~~~~~~~~~~~~~~~ > > > > It seems you haven't carefully checked call sites here, please do very > > carefully recheck these - I see Yeoreum reported a mising kasan case, so I > > suggest you just aggressively grep this + make sure you've covered all > > bases :) > > I did check all call sites pretty carefully and of course build-tested, > but my series is based on v6.17-rc4 - just like the calls Yeoreum > mentioned, the issue is that those calls are in mm-stable but not in > mainline :/ I suppose I should post a v2 rebased on mm-stable ASAP then? You should really base on mm-new. You need to account for everything that is potentially going to go upstream. mm-stable is generally not actually populated all too well until shortly before merge window anyway. > > - Kevin Thanks, Lorenzo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-05 11:21 ` Lorenzo Stoakes @ 2025-09-05 11:37 ` Lorenzo Stoakes 2025-09-05 12:22 ` Kevin Brodsky 0 siblings, 1 reply; 24+ messages in thread From: Lorenzo Stoakes @ 2025-09-05 11:37 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On Fri, Sep 05, 2025 at 12:21:40PM +0100, Lorenzo Stoakes wrote: > You should really base on mm-new. > > You need to account for everything that is potentially going to go > upstream. mm-stable is generally not actually populated all too well until > shortly before merge window anyway. Just to note that mm-unstable is also fine. Despite its name, it's substantially more stable than mm-new, which can even break the build and appears to have no checks performed on it at all. But mm-new is the most uptodate version of all the mm code. Thanks, Lorenzo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-05 11:37 ` Lorenzo Stoakes @ 2025-09-05 12:22 ` Kevin Brodsky 0 siblings, 0 replies; 24+ messages in thread From: Kevin Brodsky @ 2025-09-05 12:22 UTC (permalink / raw) To: Lorenzo Stoakes Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On 05/09/2025 13:37, Lorenzo Stoakes wrote: > On Fri, Sep 05, 2025 at 12:21:40PM +0100, Lorenzo Stoakes wrote: >> You should really base on mm-new. >> >> You need to account for everything that is potentially going to go >> upstream. mm-stable is generally not actually populated all too well until >> shortly before merge window anyway. > Just to note that mm-unstable is also fine. Despite its name, it's substantially > more stable than mm-new, which can even break the build and appears to have no > checks performed on it at all. Thanks for the overview - I had a general idea about those branches but I wasn't sure what the standard practice was. I'll rebase on mm-unstable to start with. - Kevin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/7] mm: introduce local state for lazy_mmu sections 2025-09-04 12:57 ` [PATCH 2/7] mm: introduce local state for lazy_mmu sections Kevin Brodsky 2025-09-04 15:06 ` Yeoreum Yun 2025-09-04 17:28 ` Lorenzo Stoakes @ 2025-09-05 11:19 ` Mike Rapoport 2 siblings, 0 replies; 24+ messages in thread From: Mike Rapoport @ 2025-09-05 11:19 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On Thu, Sep 04, 2025 at 01:57:31PM +0100, Kevin Brodsky wrote: > arch_{enter,leave}_lazy_mmu_mode() currently have a stateless API > (taking and returning no value). This is proving problematic in > situations where leave() needs to restore some context back to its > original state (before enter() was called). In particular, this > makes it difficult to support the nesting of lazy_mmu sections - > leave() does not know whether the matching enter() call occurred > while lazy_mmu was already enabled, and whether to disable it or > not. > > This patch gives all architectures the chance to store local state > while inside a lazy_mmu section by making enter() return some value, > storing it in a local variable, and having leave() take that value. > That value is typed lazy_mmu_state_t - each architecture defining > __HAVE_ARCH_ENTER_LAZY_MMU_MODE is free to define it as it sees fit. > For now we define it as int everywhere, which is sufficient to > support nesting. > > The diff is unfortunately rather large as all the API changes need > to be done atomically. Main parts: > > * Changing the prototypes of arch_{enter,leave}_lazy_mmu_mode() > in generic and arch code, and introducing lazy_mmu_state_t. > > * Introducing LAZY_MMU_{DEFAULT,NESTED} for future support of > nesting. enter() always returns LAZY_MMU_DEFAULT for now. > (linux/mm_types.h is not the most natural location for defining > those constants, but there is no other obvious header that is > accessible where arch's implement the helpers.) > > * Changing all lazy_mmu sections to introduce a lazy_mmu_state > local variable, having enter() set it and leave() take it. Most of > these changes were generated using the Coccinelle script below. > > @@ > @@ > { > + lazy_mmu_state_t lazy_mmu_state; > ... > - arch_enter_lazy_mmu_mode(); > + lazy_mmu_state = arch_enter_lazy_mmu_mode(); > ... > - arch_leave_lazy_mmu_mode(); > + arch_leave_lazy_mmu_mode(lazy_mmu_state); > ... > } > > Note: it is difficult to provide a default definition of > lazy_mmu_state_t for architectures implementing lazy_mmu, because > that definition would need to be available in > arch/x86/include/asm/paravirt_types.h and adding a new generic > #include there is very tricky due to the existing header soup. > > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 3/7] arm64: mm: fully support nested lazy_mmu sections 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky 2025-09-04 12:57 ` [PATCH 1/7] mm: remove arch_flush_lazy_mmu_mode() Kevin Brodsky 2025-09-04 12:57 ` [PATCH 2/7] mm: introduce local state for lazy_mmu sections Kevin Brodsky @ 2025-09-04 12:57 ` Kevin Brodsky 2025-09-04 12:57 ` [PATCH 4/7] x86/xen: support nested lazy_mmu sections (again) Kevin Brodsky ` (4 subsequent siblings) 7 siblings, 0 replies; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 12:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel Despite recent efforts to prevent lazy_mmu sections from nesting, it remains difficult to ensure that it never occurs - and in fact it does occur on arm64 in certain situations (CONFIG_DEBUG_PAGEALLOC). Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested") made nesting tolerable on arm64, but without truly supporting it: the inner leave() call clears TIF_LAZY_MMU, disabling the batching optimisation before the outer section ends. Now that the lazy_mmu API allows enter() to pass through a state to the matching leave() call, we can actually support nesting. If enter() is called inside an active lazy_mmu section, TIF_LAZY_MMU will already be set, and we can then return LAZY_MMU_NESTED to instruct the matching leave() call not to clear TIF_LAZY_MMU. The only effect of this patch is to ensure that TIF_LAZY_MMU (and therefore the batching optimisation) remains set until the outermost lazy_mmu section ends. leave() still emits barriers if needed, regardless of the nesting level, as the caller may expect any page table changes to become visible when leave() returns. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> --- arch/arm64/include/asm/pgtable.h | 19 +++++-------------- 1 file changed, 5 insertions(+), 14 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 816197d08165..602feda97dc4 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -85,24 +85,14 @@ typedef int lazy_mmu_state_t; static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) { - /* - * lazy_mmu_mode is not supposed to permit nesting. But in practice this - * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation - * inside a lazy_mmu_mode section (such as zap_pte_range()) will change - * permissions on the linear map with apply_to_page_range(), which - * re-enters lazy_mmu_mode. So we tolerate nesting in our - * implementation. The first call to arch_leave_lazy_mmu_mode() will - * flush and clear the flag such that the remainder of the work in the - * outer nest behaves as if outside of lazy mmu mode. This is safe and - * keeps tracking simple. - */ + int lazy_mmu_nested; if (in_interrupt()) return LAZY_MMU_DEFAULT; - set_thread_flag(TIF_LAZY_MMU); + lazy_mmu_nested = test_and_set_thread_flag(TIF_LAZY_MMU); - return LAZY_MMU_DEFAULT; + return lazy_mmu_nested ? LAZY_MMU_NESTED : LAZY_MMU_DEFAULT; } static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) @@ -113,7 +103,8 @@ static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) if (test_and_clear_thread_flag(TIF_LAZY_MMU_PENDING)) emit_pte_barriers(); - clear_thread_flag(TIF_LAZY_MMU); + if (state != LAZY_MMU_NESTED) + clear_thread_flag(TIF_LAZY_MMU); } #ifdef CONFIG_TRANSPARENT_HUGEPAGE -- 2.47.0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 4/7] x86/xen: support nested lazy_mmu sections (again) 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky ` (2 preceding siblings ...) 2025-09-04 12:57 ` [PATCH 3/7] arm64: mm: fully support nested " Kevin Brodsky @ 2025-09-04 12:57 ` Kevin Brodsky 2025-09-05 15:48 ` Alexander Gordeev 2025-09-04 12:57 ` [PATCH 5/7] powerpc/mm: support nested lazy_mmu sections Kevin Brodsky ` (3 subsequent siblings) 7 siblings, 1 reply; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 12:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel Commit 49147beb0ccb ("x86/xen: allow nesting of same lazy mode") originally introduced support for nested lazy sections (LAZY_MMU and LAZY_CPU). It later got reverted by commit c36549ff8d84 as its implementation turned out to be intolerant to preemption. Now that the lazy_mmu API allows enter() to pass through a state to the matching leave() call, we can support nesting again for the LAZY_MMU mode in a preemption-safe manner. If xen_enter_lazy_mmu() is called inside an active lazy_mmu section, xen_lazy_mode will already be set to XEN_LAZY_MMU and we can then return LAZY_MMU_NESTED to instruct the matching xen_leave_lazy_mmu() call to leave xen_lazy_mode unchanged. The only effect of this patch is to ensure that xen_lazy_mode remains set to XEN_LAZY_MMU until the outermost lazy_mmu section ends. xen_leave_lazy_mmu() still calls xen_mc_flush() unconditionally. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> --- arch/x86/include/asm/paravirt.h | 6 ++---- arch/x86/include/asm/paravirt_types.h | 4 ++-- arch/x86/xen/mmu_pv.c | 11 ++++++++--- 3 files changed, 12 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 65a0d394fba1..4ecd3a6b1dea 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -529,14 +529,12 @@ static inline void arch_end_context_switch(struct task_struct *next) #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) { - PVOP_VCALL0(mmu.lazy_mode.enter); - - return LAZY_MMU_DEFAULT; + return PVOP_CALL0(lazy_mmu_state_t, mmu.lazy_mode.enter); } static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) { - PVOP_VCALL0(mmu.lazy_mode.leave); + PVOP_VCALL1(mmu.lazy_mode.leave, state); } static inline void arch_flush_lazy_mmu_mode(void) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index bc1af86868a3..b7c567ccbf32 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -45,8 +45,8 @@ typedef int lazy_mmu_state_t; struct pv_lazy_ops { /* Set deferred update mode, used for batching operations. */ - void (*enter)(void); - void (*leave)(void); + lazy_mmu_state_t (*enter)(void); + void (*leave)(lazy_mmu_state_t); void (*flush)(void); } __no_randomize_layout; #endif diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 2039d5132ca3..6e5390ff06a5 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2130,9 +2130,13 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot) #endif } -static void xen_enter_lazy_mmu(void) +static lazy_mmu_state_t xen_enter_lazy_mmu(void) { + if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) + return LAZY_MMU_NESTED; + enter_lazy(XEN_LAZY_MMU); + return LAZY_MMU_DEFAULT; } static void xen_flush_lazy_mmu(void) @@ -2167,11 +2171,12 @@ static void __init xen_post_allocator_init(void) pv_ops.mmu.write_cr3 = &xen_write_cr3; } -static void xen_leave_lazy_mmu(void) +static void xen_leave_lazy_mmu(lazy_mmu_state_t state) { preempt_disable(); xen_mc_flush(); - leave_lazy(XEN_LAZY_MMU); + if (state != LAZY_MMU_NESTED) + leave_lazy(XEN_LAZY_MMU); preempt_enable(); } -- 2.47.0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/7] x86/xen: support nested lazy_mmu sections (again) 2025-09-04 12:57 ` [PATCH 4/7] x86/xen: support nested lazy_mmu sections (again) Kevin Brodsky @ 2025-09-05 15:48 ` Alexander Gordeev 2025-09-08 7:32 ` Kevin Brodsky 0 siblings, 1 reply; 24+ messages in thread From: Alexander Gordeev @ 2025-09-05 15:48 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On Thu, Sep 04, 2025 at 01:57:33PM +0100, Kevin Brodsky wrote: ... > -static void xen_enter_lazy_mmu(void) > +static lazy_mmu_state_t xen_enter_lazy_mmu(void) > { > + if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) > + return LAZY_MMU_NESTED; > + > enter_lazy(XEN_LAZY_MMU); > + return LAZY_MMU_DEFAULT; > } > > static void xen_flush_lazy_mmu(void) > @@ -2167,11 +2171,12 @@ static void __init xen_post_allocator_init(void) > pv_ops.mmu.write_cr3 = &xen_write_cr3; > } > > -static void xen_leave_lazy_mmu(void) > +static void xen_leave_lazy_mmu(lazy_mmu_state_t state) > { > preempt_disable(); > xen_mc_flush(); > - leave_lazy(XEN_LAZY_MMU); > + if (state != LAZY_MMU_NESTED) > + leave_lazy(XEN_LAZY_MMU); Based on xen_enter_lazy_mmu(), whether this condition needs to be executed with the preemption disabled? Or may be this_cpu_read(xen_lazy_mode) + enter_lazy(XEN_LAZY_MMU) should be executed with the preemption disabled? > preempt_enable(); > } Thanks! ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 4/7] x86/xen: support nested lazy_mmu sections (again) 2025-09-05 15:48 ` Alexander Gordeev @ 2025-09-08 7:32 ` Kevin Brodsky 0 siblings, 0 replies; 24+ messages in thread From: Kevin Brodsky @ 2025-09-08 7:32 UTC (permalink / raw) To: Alexander Gordeev Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On 05/09/2025 17:48, Alexander Gordeev wrote: > On Thu, Sep 04, 2025 at 01:57:33PM +0100, Kevin Brodsky wrote: > ... >> -static void xen_enter_lazy_mmu(void) >> +static lazy_mmu_state_t xen_enter_lazy_mmu(void) >> { >> + if (this_cpu_read(xen_lazy_mode) == XEN_LAZY_MMU) >> + return LAZY_MMU_NESTED; >> + >> enter_lazy(XEN_LAZY_MMU); >> + return LAZY_MMU_DEFAULT; >> } >> >> static void xen_flush_lazy_mmu(void) >> @@ -2167,11 +2171,12 @@ static void __init xen_post_allocator_init(void) >> pv_ops.mmu.write_cr3 = &xen_write_cr3; >> } >> >> -static void xen_leave_lazy_mmu(void) >> +static void xen_leave_lazy_mmu(lazy_mmu_state_t state) >> { >> preempt_disable(); >> xen_mc_flush(); >> - leave_lazy(XEN_LAZY_MMU); >> + if (state != LAZY_MMU_NESTED) >> + leave_lazy(XEN_LAZY_MMU); > Based on xen_enter_lazy_mmu(), whether this condition needs to be > executed with the preemption disabled? AFAIU xen_mc_flush() needs preemption to be disabled. I don't think {enter,leave}_lazy() do, but this patch doesn't introduce any change from that perspective. I suppose it doesn't hurt that xen_leave_lazy_mmu() calls leave_lazy() with preemption disabled. > Or may be this_cpu_read(xen_lazy_mode) + enter_lazy(XEN_LAZY_MMU) > should be executed with the preemption disabled? Adding another this_cpu_read(xen_lazy_mode) in xen_enter_lazy_mmu() shouldn't change the situation, i.e. preemption should still be safe. If preemption occurs in the middle of that function, xen_{start,end}_context_switch() will do the right thing to save/restore xen_lazy_mode. - Kevin ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 5/7] powerpc/mm: support nested lazy_mmu sections 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky ` (3 preceding siblings ...) 2025-09-04 12:57 ` [PATCH 4/7] x86/xen: support nested lazy_mmu sections (again) Kevin Brodsky @ 2025-09-04 12:57 ` Kevin Brodsky 2025-09-05 15:52 ` Alexander Gordeev 2025-09-04 12:57 ` [PATCH 6/7] sparc/mm: " Kevin Brodsky ` (2 subsequent siblings) 7 siblings, 1 reply; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 12:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel The lazy_mmu API now allows nested sections to be handled by arch code: enter() can return a flag if called inside another lazy_mmu section, so that the matching call to leave() leaves any optimisation enabled. This patch implements that new logic for powerpc: if there is an active batch, then enter() returns LAZY_MMU_NESTED and the matching leave() leaves batch->active set. The preempt_{enable,disable} calls are left untouched as they already handle nesting themselves. TLB flushing is still done in leave() regardless of the nesting level, as the caller may rely on it whether nesting is occurring or not. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> --- .../powerpc/include/asm/book3s/64/tlbflush-hash.h | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h index c9f1e819e567..001c474da1fe 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h @@ -30,6 +30,7 @@ typedef int lazy_mmu_state_t; static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) { struct ppc64_tlb_batch *batch; + int lazy_mmu_nested; if (radix_enabled()) return LAZY_MMU_DEFAULT; @@ -39,9 +40,14 @@ static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) */ preempt_disable(); batch = this_cpu_ptr(&ppc64_tlb_batch); - batch->active = 1; + lazy_mmu_nested = batch->active; - return LAZY_MMU_DEFAULT; + if (!lazy_mmu_nested) { + batch->active = 1; + return LAZY_MMU_DEFAULT; + } else { + return LAZY_MMU_NESTED; + } } static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) @@ -54,7 +60,10 @@ static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) if (batch->index) __flush_tlb_pending(batch); - batch->active = 0; + + if (state != LAZY_MMU_NESTED) + batch->active = 0; + preempt_enable(); } -- 2.47.0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 5/7] powerpc/mm: support nested lazy_mmu sections 2025-09-04 12:57 ` [PATCH 5/7] powerpc/mm: support nested lazy_mmu sections Kevin Brodsky @ 2025-09-05 15:52 ` Alexander Gordeev 2025-09-08 7:32 ` Kevin Brodsky 0 siblings, 1 reply; 24+ messages in thread From: Alexander Gordeev @ 2025-09-05 15:52 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On Thu, Sep 04, 2025 at 01:57:34PM +0100, Kevin Brodsky wrote: ... > static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) > { > struct ppc64_tlb_batch *batch; > + int lazy_mmu_nested; > > if (radix_enabled()) > return LAZY_MMU_DEFAULT; > @@ -39,9 +40,14 @@ static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) > */ > preempt_disable(); > batch = this_cpu_ptr(&ppc64_tlb_batch); > - batch->active = 1; > + lazy_mmu_nested = batch->active; > > - return LAZY_MMU_DEFAULT; > + if (!lazy_mmu_nested) { Why not just? if (!batch->active) { > + batch->active = 1; > + return LAZY_MMU_DEFAULT; > + } else { > + return LAZY_MMU_NESTED; > + } > } > > static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) > @@ -54,7 +60,10 @@ static inline void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) > > if (batch->index) > __flush_tlb_pending(batch); > - batch->active = 0; > + > + if (state != LAZY_MMU_NESTED) > + batch->active = 0; > + > preempt_enable(); > } Thanks! ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 5/7] powerpc/mm: support nested lazy_mmu sections 2025-09-05 15:52 ` Alexander Gordeev @ 2025-09-08 7:32 ` Kevin Brodsky 0 siblings, 0 replies; 24+ messages in thread From: Kevin Brodsky @ 2025-09-08 7:32 UTC (permalink / raw) To: Alexander Gordeev Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On 05/09/2025 17:52, Alexander Gordeev wrote: > On Thu, Sep 04, 2025 at 01:57:34PM +0100, Kevin Brodsky wrote: > ... >> static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) >> { >> struct ppc64_tlb_batch *batch; >> + int lazy_mmu_nested; >> >> if (radix_enabled()) >> return LAZY_MMU_DEFAULT; >> @@ -39,9 +40,14 @@ static inline lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) >> */ >> preempt_disable(); >> batch = this_cpu_ptr(&ppc64_tlb_batch); >> - batch->active = 1; >> + lazy_mmu_nested = batch->active; >> >> - return LAZY_MMU_DEFAULT; >> + if (!lazy_mmu_nested) { > Why not just? > > if (!batch->active) { Very fair question! I think the extra variable made sense in an earlier version of that patch, but now it's used only once and doesn't really improve readability either. Will remove it in v2, also in patch 6 (basically the same code). Thanks! - Kevin ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 6/7] sparc/mm: support nested lazy_mmu sections 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky ` (4 preceding siblings ...) 2025-09-04 12:57 ` [PATCH 5/7] powerpc/mm: support nested lazy_mmu sections Kevin Brodsky @ 2025-09-04 12:57 ` Kevin Brodsky 2025-09-04 12:57 ` [PATCH 7/7] mm: update lazy_mmu documentation Kevin Brodsky 2025-09-05 9:46 ` [PATCH 0/7] Nesting support for lazy MMU mode Alexander Gordeev 7 siblings, 0 replies; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 12:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel The lazy_mmu API now allows nested sections to be handled by arch code: enter() can return a flag if called inside another lazy_mmu section, so that the matching call to leave() leaves any optimisation enabled. This patch implements that new logic for sparc: if there is an active batch, then enter() returns LAZY_MMU_NESTED and the matching leave() leaves batch->active set. The preempt_{enable,disable} calls are left untouched as they already handle nesting themselves. TLB flushing is still done in leave() regardless of the nesting level, as the caller may rely on it whether nesting is occurring or not. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> --- arch/sparc/mm/tlb.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c index bf5094b770af..42de93d74d0e 100644 --- a/arch/sparc/mm/tlb.c +++ b/arch/sparc/mm/tlb.c @@ -53,12 +53,18 @@ void flush_tlb_pending(void) lazy_mmu_state_t arch_enter_lazy_mmu_mode(void) { struct tlb_batch *tb; + int lazy_mmu_nested; preempt_disable(); tb = this_cpu_ptr(&tlb_batch); - tb->active = 1; + lazy_mmu_nested = tb->active; - return LAZY_MMU_DEFAULT; + if (!lazy_mmu_nested) { + tb->active = 1; + return LAZY_MMU_DEFAULT; + } else { + return LAZY_MMU_NESTED; + } } void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) @@ -67,7 +73,10 @@ void arch_leave_lazy_mmu_mode(lazy_mmu_state_t state) if (tb->tlb_nr) flush_tlb_pending(); - tb->active = 0; + + if (state != LAZY_MMU_NESTED) + tb->active = 0; + preempt_enable(); } -- 2.47.0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 7/7] mm: update lazy_mmu documentation 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky ` (5 preceding siblings ...) 2025-09-04 12:57 ` [PATCH 6/7] sparc/mm: " Kevin Brodsky @ 2025-09-04 12:57 ` Kevin Brodsky 2025-09-05 11:13 ` Mike Rapoport 2025-09-05 9:46 ` [PATCH 0/7] Nesting support for lazy MMU mode Alexander Gordeev 7 siblings, 1 reply; 24+ messages in thread From: Kevin Brodsky @ 2025-09-04 12:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, Kevin Brodsky, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel We now support nested lazy_mmu sections on all architectures implementing the API. Update the API comment accordingly. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> --- include/linux/pgtable.h | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 6932c8e344ab..be0f059beb4d 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -228,8 +228,18 @@ static inline int pmd_dirty(pmd_t pmd) * of the lazy mode. So the implementation must assume preemption may be enabled * and cpu migration is possible; it must take steps to be robust against this. * (In practice, for user PTE updates, the appropriate page table lock(s) are - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted - * and the mode cannot be used in interrupt context. + * held, but for kernel PTE updates, no lock is held). The mode cannot be used + * in interrupt context. + * + * Calls may be nested: an arch_{enter,leave}_lazy_mmu_mode() pair may be called + * while the lazy MMU mode has already been enabled. An implementation should + * handle this using the state returned by enter() and taken by the matching + * leave() call; the LAZY_MMU_{DEFAULT,NESTED} flags can be used to indicate + * whether this enter/leave pair is nested inside another or not. (It is up to + * the implementation to track whether the lazy MMU mode is enabled at any point + * in time.) The expectation is that leave() will flush any batched state + * unconditionally, but only leave the lazy MMU mode if the passed state is not + * LAZY_MMU_NESTED. */ #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE typedef int lazy_mmu_state_t; -- 2.47.0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 7/7] mm: update lazy_mmu documentation 2025-09-04 12:57 ` [PATCH 7/7] mm: update lazy_mmu documentation Kevin Brodsky @ 2025-09-05 11:13 ` Mike Rapoport 0 siblings, 0 replies; 24+ messages in thread From: Mike Rapoport @ 2025-09-05 11:13 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Alexander Gordeev, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On Thu, Sep 04, 2025 at 01:57:36PM +0100, Kevin Brodsky wrote: > We now support nested lazy_mmu sections on all architectures > implementing the API. Update the API comment accordingly. > > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > --- > include/linux/pgtable.h | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 6932c8e344ab..be0f059beb4d 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -228,8 +228,18 @@ static inline int pmd_dirty(pmd_t pmd) > * of the lazy mode. So the implementation must assume preemption may be enabled > * and cpu migration is possible; it must take steps to be robust against this. > * (In practice, for user PTE updates, the appropriate page table lock(s) are > - * held, but for kernel PTE updates, no lock is held). Nesting is not permitted > - * and the mode cannot be used in interrupt context. > + * held, but for kernel PTE updates, no lock is held). The mode cannot be used > + * in interrupt context. > + * > + * Calls may be nested: an arch_{enter,leave}_lazy_mmu_mode() pair may be called > + * while the lazy MMU mode has already been enabled. An implementation should > + * handle this using the state returned by enter() and taken by the matching > + * leave() call; the LAZY_MMU_{DEFAULT,NESTED} flags can be used to indicate > + * whether this enter/leave pair is nested inside another or not. (It is up to > + * the implementation to track whether the lazy MMU mode is enabled at any point > + * in time.) The expectation is that leave() will flush any batched state > + * unconditionally, but only leave the lazy MMU mode if the passed state is not > + * LAZY_MMU_NESTED. > */ > #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE > typedef int lazy_mmu_state_t; > -- > 2.47.0 > -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/7] Nesting support for lazy MMU mode 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky ` (6 preceding siblings ...) 2025-09-04 12:57 ` [PATCH 7/7] mm: update lazy_mmu documentation Kevin Brodsky @ 2025-09-05 9:46 ` Alexander Gordeev 2025-09-05 12:11 ` Kevin Brodsky 7 siblings, 1 reply; 24+ messages in thread From: Alexander Gordeev @ 2025-09-05 9:46 UTC (permalink / raw) To: Kevin Brodsky Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On Thu, Sep 04, 2025 at 01:57:29PM +0100, Kevin Brodsky wrote: Hi Kevin, > When the lazy MMU mode was introduced eons ago, it wasn't made clear > whether such a sequence was legal: > > arch_enter_lazy_mmu_mode() > ... > arch_enter_lazy_mmu_mode() > ... > arch_leave_lazy_mmu_mode() > ... > arch_leave_lazy_mmu_mode() I did not take too deep - sorry if you already answered this. Quick question - whether a concern Ryan expressed is addressed in general case? https://lore.kernel.org/all/3cad01ea-b704-4156-807e-7a83643917a8@arm.com/ enter_lazy_mmu for_each_pte { read/modify-write pte alloc_page enter_lazy_mmu make page valid exit_lazy_mmu write_to_page } exit_lazy_mmu <quote> This example only works because lazy_mmu doesn't support nesting. The "make page valid" operation is completed by the time of the inner exit_lazy_mmu so that the page can be accessed in write_to_page. If nesting was supported, the inner exit_lazy_mmu would become a nop and write_to_page would explode. </quote> ... Thanks! ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 0/7] Nesting support for lazy MMU mode 2025-09-05 9:46 ` [PATCH 0/7] Nesting support for lazy MMU mode Alexander Gordeev @ 2025-09-05 12:11 ` Kevin Brodsky 0 siblings, 0 replies; 24+ messages in thread From: Kevin Brodsky @ 2025-09-05 12:11 UTC (permalink / raw) To: Alexander Gordeev Cc: linux-mm, linux-kernel, Andreas Larsson, Andrew Morton, Boris Ostrovsky, Borislav Petkov, Catalin Marinas, Christophe Leroy, Dave Hansen, David Hildenbrand, David S. Miller, H. Peter Anvin, Ingo Molnar, Jann Horn, Juergen Gross, Liam R. Howlett, Lorenzo Stoakes, Madhavan Srinivasan, Michael Ellerman, Michal Hocko, Mike Rapoport, Nicholas Piggin, Peter Zijlstra, Ryan Roberts, Suren Baghdasaryan, Thomas Gleixner, Vlastimil Babka, Will Deacon, linux-arm-kernel, linuxppc-dev, sparclinux, xen-devel On 05/09/2025 11:46, Alexander Gordeev wrote: > On Thu, Sep 04, 2025 at 01:57:29PM +0100, Kevin Brodsky wrote: > > Hi Kevin, > >> When the lazy MMU mode was introduced eons ago, it wasn't made clear >> whether such a sequence was legal: >> >> arch_enter_lazy_mmu_mode() >> ... >> arch_enter_lazy_mmu_mode() >> ... >> arch_leave_lazy_mmu_mode() >> ... >> arch_leave_lazy_mmu_mode() > I did not take too deep - sorry if you already answered this. > Quick question - whether a concern Ryan expressed is addressed > in general case? The short answer is yes - it's good that you're asking because I failed to clarify this in the cover letter! > https://lore.kernel.org/all/3cad01ea-b704-4156-807e-7a83643917a8@arm.com/ > > enter_lazy_mmu > for_each_pte { > read/modify-write pte > > alloc_page > enter_lazy_mmu > make page valid > exit_lazy_mmu > > write_to_page > } > exit_lazy_mmu > > <quote> > This example only works because lazy_mmu doesn't support nesting. The "make page > valid" operation is completed by the time of the inner exit_lazy_mmu so that the > page can be accessed in write_to_page. If nesting was supported, the inner > exit_lazy_mmu would become a nop and write_to_page would explode. > </quote> Further down in the cover letter I refer to the approach Catalin suggested [4]. This was in fact in response to this concern from Ryan. The key point is: leave() keeps the lazy MMU mode enabled if it is nested, but it flushes any batched state *unconditionally*, regardless of nesting level. See patch 3-6 on the practical implementation of this; patch 7 also spells it out in the documentation. Hope that clarifies the situation! - Kevin [4] https://lore.kernel.org/all/aEhKSq0zVaUJkomX@arm.com/ ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2025-09-08 7:32 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-09-04 12:57 [PATCH 0/7] Nesting support for lazy MMU mode Kevin Brodsky 2025-09-04 12:57 ` [PATCH 1/7] mm: remove arch_flush_lazy_mmu_mode() Kevin Brodsky 2025-09-05 11:00 ` Mike Rapoport 2025-09-04 12:57 ` [PATCH 2/7] mm: introduce local state for lazy_mmu sections Kevin Brodsky 2025-09-04 15:06 ` Yeoreum Yun 2025-09-04 15:47 ` Kevin Brodsky 2025-09-04 17:28 ` Lorenzo Stoakes 2025-09-04 22:14 ` Kevin Brodsky 2025-09-05 11:21 ` Lorenzo Stoakes 2025-09-05 11:37 ` Lorenzo Stoakes 2025-09-05 12:22 ` Kevin Brodsky 2025-09-05 11:19 ` Mike Rapoport 2025-09-04 12:57 ` [PATCH 3/7] arm64: mm: fully support nested " Kevin Brodsky 2025-09-04 12:57 ` [PATCH 4/7] x86/xen: support nested lazy_mmu sections (again) Kevin Brodsky 2025-09-05 15:48 ` Alexander Gordeev 2025-09-08 7:32 ` Kevin Brodsky 2025-09-04 12:57 ` [PATCH 5/7] powerpc/mm: support nested lazy_mmu sections Kevin Brodsky 2025-09-05 15:52 ` Alexander Gordeev 2025-09-08 7:32 ` Kevin Brodsky 2025-09-04 12:57 ` [PATCH 6/7] sparc/mm: " Kevin Brodsky 2025-09-04 12:57 ` [PATCH 7/7] mm: update lazy_mmu documentation Kevin Brodsky 2025-09-05 11:13 ` Mike Rapoport 2025-09-05 9:46 ` [PATCH 0/7] Nesting support for lazy MMU mode Alexander Gordeev 2025-09-05 12:11 ` Kevin Brodsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox