* [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode
@ 2026-04-15 15:01 Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 1/6] mm: Make lazy MMU mode context-aware Alexander Gordeev
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-04-15 15:01 UTC (permalink / raw)
To: Kevin Brodsky, David Hildenbrand, Andrew Morton, Gerald Schaefer,
Heiko Carstens, Christian Borntraeger, Vasily Gorbik,
Claudio Imbrenda
Cc: linux-s390, linux-mm, linux-kernel
Hi All!
This is v2 of the batched PTE updates in lazy MMU mode rework.
Changes since v1:
- lazy_mmu_mode_enable_pte() renamed to lazy_mmu_mode_enable_for_pte_range()
- lazy_mmu_mode_enable_for_pte_range() semantics clarified
- some sashiko comments addressed [1] including one bug fix
- patches 2-4 added
Patches 3-6 is the actual s390 rework, while patches 1,2 are a prerequisite
that affects the generic code.
Patch 6 needs to be mereged in patch 5 if deemed useful.
This series addresses an s390-specific aspect of how page table entries
are modified. In many cases, changing a valid PTE (for example, setting
or clearing a hardware bit) requires issuing an Invalidate Page Table
Entry (IPTE) instruction beforehand.
A disadvantage of the IPTE instruction is that it may initiate a
machine-wide quiesce state. This state acts as an expensive global
hardware lock and should be avoided whenever possible.
Currently, IPTE is invoked for each individual PTE update in most code
paths. However, the instruction itself supports invalidating multiple
PTEs at once, covering up to 256 entries. Using this capability can
significantly reduce the number of quiesce events, with a positive
impact on overall system performance. At present, this feature is not
utilized.
An effort was therefore made to identify kernel code paths that update
large numbers of consecutive PTEs. Such updates can be batched and
handled by a single IPTE invocation, leveraging the hardware support
described above.
A natural candidate for this optimization is page-table walkers that
change attributes of memory ranges and thus modify contiguous ranges
of PTEs. Many memory-management system calls enter lazy MMU mode while
updating such ranges.
This lazy MMU mode can be leveraged to build on the already existing
infrastructure and implement a software-level lazy MMU mechanism,
allowing expensive PTE invalidations on s390 to be batched.
1. https://sashiko.dev/#/patchset/cover.1774420056.git.agordeev%40linux.ibm.com
Thanks!
Alexander Gordeev (6):
mm: Make lazy MMU mode context-aware
mm/pgtable: Fix bogus comment to clear_not_present_full_ptes()
s390/mm: Complete ptep_get() conversion
s390/mm: Make PTC and UV call order consistent
s390/mm: Batch PTE updates in lazy MMU mode
s390/mm: Allow lazy MMU mode disabling
arch/s390/Kconfig | 11 +
arch/s390/boot/vmem.c | 32 +--
arch/s390/include/asm/hugetlb.h | 2 +-
arch/s390/include/asm/pgtable.h | 258 ++++++++++++++++----
arch/s390/mm/Makefile | 1 +
arch/s390/mm/hugetlbpage.c | 12 +-
arch/s390/mm/ipte_batch.c | 400 ++++++++++++++++++++++++++++++++
arch/s390/mm/pageattr.c | 42 ++--
arch/s390/mm/pgtable.c | 8 +-
arch/s390/mm/vmem.c | 82 ++++---
fs/proc/task_mmu.c | 2 +-
include/linux/pgtable.h | 51 +++-
mm/madvise.c | 8 +-
mm/memory.c | 8 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/vmalloc.c | 6 +-
17 files changed, 781 insertions(+), 146 deletions(-)
create mode 100644 arch/s390/mm/ipte_batch.c
--
2.51.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/6] mm: Make lazy MMU mode context-aware
2026-04-15 15:01 [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
@ 2026-04-15 15:01 ` Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 2/6] mm/pgtable: Fix bogus comment to clear_not_present_full_ptes() Alexander Gordeev
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-04-15 15:01 UTC (permalink / raw)
To: Kevin Brodsky, David Hildenbrand, Andrew Morton, Gerald Schaefer,
Heiko Carstens, Christian Borntraeger, Vasily Gorbik,
Claudio Imbrenda
Cc: linux-s390, linux-mm, linux-kernel
Lazy MMU mode is assumed to be context-independent, in the sense
that it does not need any additional information while operating.
However, the s390 architecture benefits from knowing the exact
page table entries being modified.
Introduce lazy_mmu_mode_enable_for_pte_range(), which is provided
with the process address space and the page table being operated on.
This information is required to enable s390-specific optimizations.
The function takes parameters that are typically passed to page-
table level walkers, which implies that the span of PTE entries
never crosses a page table boundary.
Architectures that do not require such information simply do not
need to define the lazy_mmu_mode_enable_for_pte_range() callback.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
fs/proc/task_mmu.c | 2 +-
include/linux/pgtable.h | 47 +++++++++++++++++++++++++++++++++++++++++
mm/madvise.c | 8 +++----
mm/memory.c | 8 +++----
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/vmalloc.c | 6 +++---
7 files changed, 61 insertions(+), 14 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e091931d7ca1..799db0d7ec8b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2752,7 +2752,7 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
return 0;
}
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(vma->vm_mm, start, end, start_pte);
if ((p->arg.flags & PM_SCAN_WP_MATCHING) && !p->vec_out) {
/* Fast path for performing exclusive WP */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a50df42a893f..9ff7b78d65b1 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -271,6 +271,51 @@ static inline void lazy_mmu_mode_enable(void)
arch_enter_lazy_mmu_mode();
}
+#ifndef arch_enter_lazy_mmu_mode_for_pte_range
+static inline void arch_enter_lazy_mmu_mode_for_pte_range(struct mm_struct *mm,
+ unsigned long addr, unsigned long end, pte_t *ptep)
+{
+ arch_enter_lazy_mmu_mode();
+}
+#endif
+
+/**
+ * lazy_mmu_mode_enable_for_pte_range() - Enable the lazy MMU mode with a speedup hint.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Start address of the range.
+ * @end: End address of the range.
+ * @ptep: Page table pointer for the first entry.
+ *
+ * Enters a new lazy MMU mode section; if the mode was not already enabled,
+ * enables it and calls arch_enter_lazy_mmu_mode_for_pte_range().
+ *
+ * PTEs that fall within the specified range might observe update speedups.
+ * The PTE range must belong to the specified memory space and not cross
+ * a page table boundary.
+ *
+ * There are no requirements on the order or range completeness of PTE
+ * updates for the specified range.
+ *
+ * Must be paired with a call to lazy_mmu_mode_disable().
+ *
+ * Has no effect if called:
+ * - While paused - see lazy_mmu_mode_pause()
+ * - In interrupt context
+ */
+static inline void lazy_mmu_mode_enable_for_pte_range(struct mm_struct *mm,
+ unsigned long addr, unsigned long end, pte_t *ptep)
+{
+ struct lazy_mmu_state *state = ¤t->lazy_mmu_state;
+
+ if (in_interrupt() || state->pause_count > 0)
+ return;
+
+ VM_WARN_ON_ONCE(state->enable_count == U8_MAX);
+
+ if (state->enable_count++ == 0)
+ arch_enter_lazy_mmu_mode_for_pte_range(mm, addr, end, ptep);
+}
+
/**
* lazy_mmu_mode_disable() - Disable the lazy MMU mode.
*
@@ -353,6 +398,8 @@ static inline void lazy_mmu_mode_resume(void)
}
#else
static inline void lazy_mmu_mode_enable(void) {}
+static inline void lazy_mmu_mode_enable_for_pte_range(struct mm_struct *mm,
+ unsigned long addr, unsigned long end, pte_t *ptep) {}
static inline void lazy_mmu_mode_disable(void) {}
static inline void lazy_mmu_mode_pause(void) {}
static inline void lazy_mmu_mode_resume(void) {}
diff --git a/mm/madvise.c b/mm/madvise.c
index dbb69400786d..7faac3a627ff 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -451,7 +451,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (!start_pte)
return 0;
flush_tlb_batched_pending(mm);
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(mm, addr, end, start_pte);
for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
nr = 1;
ptent = ptep_get(pte);
@@ -506,7 +506,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
if (!start_pte)
break;
flush_tlb_batched_pending(mm);
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(mm, addr, end, start_pte);
if (!err)
nr = 0;
continue;
@@ -673,7 +673,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (!start_pte)
return 0;
flush_tlb_batched_pending(mm);
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(mm, addr, end, start_pte);
for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) {
nr = 1;
ptent = ptep_get(pte);
@@ -733,7 +733,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (!start_pte)
break;
flush_tlb_batched_pending(mm);
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(mm, addr, end, pte);
if (!err)
nr = 0;
continue;
diff --git a/mm/memory.c b/mm/memory.c
index c65e82c86fed..4c0f266df92a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1269,7 +1269,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
orig_src_pte = src_pte;
orig_dst_pte = dst_pte;
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(src_mm, addr, end, src_pte);
do {
nr = 1;
@@ -1917,7 +1917,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
return addr;
flush_tlb_batched_pending(mm);
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(mm, addr, end, start_pte);
do {
bool any_skipped = false;
@@ -2875,7 +2875,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
mapped_pte = pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
if (!pte)
return -ENOMEM;
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(mm, addr, end, mapped_pte);
do {
BUG_ON(!pte_none(ptep_get(pte)));
if (!pfn_modify_allowed(pfn, prot)) {
@@ -3235,7 +3235,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
return -EINVAL;
}
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(mm, addr, end, mapped_pte);
if (fn) {
do {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c0571445bef7..a7bfb4516dc5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -233,7 +233,7 @@ static long change_pte_range(struct mmu_gather *tlb,
is_private_single_threaded = vma_is_single_threaded_private(vma);
flush_tlb_batched_pending(vma->vm_mm);
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(vma->vm_mm, addr, end, pte);
do {
nr_ptes = 1;
oldpte = ptep_get(pte);
diff --git a/mm/mremap.c b/mm/mremap.c
index 2be876a70cc0..16320242da51 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -260,7 +260,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
flush_tlb_batched_pending(vma->vm_mm);
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(mm, old_addr, old_end, old_ptep);
for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 61caa55a4402..35a23044a969 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -108,7 +108,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
if (!pte)
return -ENOMEM;
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(&init_mm, addr, end, pte);
do {
if (unlikely(!pte_none(ptep_get(pte)))) {
@@ -371,7 +371,7 @@ static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
unsigned long size = PAGE_SIZE;
pte = pte_offset_kernel(pmd, addr);
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(&init_mm, addr, end, pte);
do {
#ifdef CONFIG_HUGETLB_PAGE
@@ -538,7 +538,7 @@ static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
if (!pte)
return -ENOMEM;
- lazy_mmu_mode_enable();
+ lazy_mmu_mode_enable_for_pte_range(&init_mm, addr, end, pte);
do {
struct page *page = pages[*nr];
--
2.51.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/6] mm/pgtable: Fix bogus comment to clear_not_present_full_ptes()
2026-04-15 15:01 [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 1/6] mm: Make lazy MMU mode context-aware Alexander Gordeev
@ 2026-04-15 15:01 ` Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 3/6] s390/mm: Complete ptep_get() conversion Alexander Gordeev
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-04-15 15:01 UTC (permalink / raw)
To: Kevin Brodsky, David Hildenbrand, Andrew Morton, Gerald Schaefer,
Heiko Carstens, Christian Borntraeger, Vasily Gorbik,
Claudio Imbrenda
Cc: linux-s390, linux-mm, linux-kernel
The address provided to clear_not_present_full_ptes() is the
address of the underlying memory, not address of the first PTE.
The exact wording is taken from clear_ptes() comment.
Suggested-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
include/linux/pgtable.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 9ff7b78d65b1..2b82a71f13d7 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1021,8 +1021,8 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
/**
* clear_not_present_full_ptes - Clear multiple not present PTEs which are
* consecutive in the pgtable.
- * @mm: Address space the ptes represent.
- * @addr: Address of the first pte.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
* @ptep: Page table pointer for the first entry.
* @nr: Number of entries to clear.
* @full: Whether we are clearing a full mm.
--
2.51.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 3/6] s390/mm: Complete ptep_get() conversion
2026-04-15 15:01 [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 1/6] mm: Make lazy MMU mode context-aware Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 2/6] mm/pgtable: Fix bogus comment to clear_not_present_full_ptes() Alexander Gordeev
@ 2026-04-15 15:01 ` Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 4/6] s390/mm: Make PTC and UV call order consistent Alexander Gordeev
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-04-15 15:01 UTC (permalink / raw)
To: Kevin Brodsky, David Hildenbrand, Andrew Morton, Gerald Schaefer,
Heiko Carstens, Christian Borntraeger, Vasily Gorbik,
Claudio Imbrenda
Cc: linux-s390, linux-mm, linux-kernel
Finalize commit c33c794828f2 ("mm: ptep_get() conversion") and
replace direct page table entry dereferencing with the proper
accessors (ptep_get(), pmdp_get(), etc.).
Override the default getter implementations even though they are
currently identical: pud_clear(), p4d_clear(), and pgd_clear()
require corresponding architecture-specific getters, but these
are not yet defined. This avoids a dependency loop.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
arch/s390/boot/vmem.c | 32 +++++++------
arch/s390/include/asm/hugetlb.h | 2 +-
arch/s390/include/asm/pgtable.h | 60 ++++++++++++++++++------
arch/s390/mm/hugetlbpage.c | 12 ++---
arch/s390/mm/pageattr.c | 42 +++++++++--------
arch/s390/mm/vmem.c | 82 ++++++++++++++++++---------------
6 files changed, 138 insertions(+), 92 deletions(-)
diff --git a/arch/s390/boot/vmem.c b/arch/s390/boot/vmem.c
index 7d6cc4c85af0..ff6d58a476ba 100644
--- a/arch/s390/boot/vmem.c
+++ b/arch/s390/boot/vmem.c
@@ -338,7 +338,7 @@ static void pgtable_pte_populate(pmd_t *pmd, unsigned long addr, unsigned long e
pte = pte_offset_kernel(pmd, addr);
for (; addr < end; addr += PAGE_SIZE, pte++) {
- if (pte_none(*pte)) {
+ if (pte_none(ptep_get(pte))) {
if (kasan_pte_populate_zero_shadow(pte, mode))
continue;
entry = __pte(resolve_pa_may_alloc(addr, PAGE_SIZE, mode));
@@ -355,26 +355,27 @@ static void pgtable_pmd_populate(pud_t *pud, unsigned long addr, unsigned long e
enum populate_mode mode)
{
unsigned long pa, next, pages = 0;
- pmd_t *pmd, entry;
+ pmd_t *pmd, entry, large_entry;
pte_t *pte;
pmd = pmd_offset(pud, addr);
for (; addr < end; addr = next, pmd++) {
next = pmd_addr_end(addr, end);
- if (pmd_none(*pmd)) {
+ entry = pmdp_get(pmd);
+ if (pmd_none(entry)) {
if (kasan_pmd_populate_zero_shadow(pmd, addr, next, mode))
continue;
pa = try_get_large_pmd_pa(pmd, addr, next, mode);
if (pa != INVALID_PHYS_ADDR) {
- entry = __pmd(pa);
- entry = set_pmd_bit(entry, SEGMENT_KERNEL);
- set_pmd(pmd, entry);
+ large_entry = __pmd(pa);
+ large_entry = set_pmd_bit(large_entry, SEGMENT_KERNEL);
+ set_pmd(pmd, large_entry);
pages++;
continue;
}
pte = boot_pte_alloc();
pmd_populate(&init_mm, pmd, pte);
- } else if (pmd_leaf(*pmd)) {
+ } else if (pmd_leaf(entry)) {
continue;
}
pgtable_pte_populate(pmd, addr, next, mode);
@@ -387,26 +388,27 @@ static void pgtable_pud_populate(p4d_t *p4d, unsigned long addr, unsigned long e
enum populate_mode mode)
{
unsigned long pa, next, pages = 0;
- pud_t *pud, entry;
+ pud_t *pud, entry, large_entry;
pmd_t *pmd;
pud = pud_offset(p4d, addr);
for (; addr < end; addr = next, pud++) {
next = pud_addr_end(addr, end);
- if (pud_none(*pud)) {
+ entry = pudp_get(pud);
+ if (pud_none(entry)) {
if (kasan_pud_populate_zero_shadow(pud, addr, next, mode))
continue;
pa = try_get_large_pud_pa(pud, addr, next, mode);
if (pa != INVALID_PHYS_ADDR) {
- entry = __pud(pa);
- entry = set_pud_bit(entry, REGION3_KERNEL);
- set_pud(pud, entry);
+ large_entry = __pud(pa);
+ large_entry = set_pud_bit(large_entry, REGION3_KERNEL);
+ set_pud(pud, large_entry);
pages++;
continue;
}
pmd = boot_crst_alloc(_SEGMENT_ENTRY_EMPTY);
pud_populate(&init_mm, pud, pmd);
- } else if (pud_leaf(*pud)) {
+ } else if (pud_leaf(entry)) {
continue;
}
pgtable_pmd_populate(pud, addr, next, mode);
@@ -425,7 +427,7 @@ static void pgtable_p4d_populate(pgd_t *pgd, unsigned long addr, unsigned long e
p4d = p4d_offset(pgd, addr);
for (; addr < end; addr = next, p4d++) {
next = p4d_addr_end(addr, end);
- if (p4d_none(*p4d)) {
+ if (p4d_none(p4dp_get(p4d))) {
if (kasan_p4d_populate_zero_shadow(p4d, addr, next, mode))
continue;
pud = boot_crst_alloc(_REGION3_ENTRY_EMPTY);
@@ -451,7 +453,7 @@ static void pgtable_populate(unsigned long addr, unsigned long end, enum populat
pgd = pgd_offset(&init_mm, addr);
for (; addr < end; addr = next, pgd++) {
next = pgd_addr_end(addr, end);
- if (pgd_none(*pgd)) {
+ if (pgd_none(pgdp_get(pgd))) {
if (kasan_pgd_populate_zero_shadow(pgd, addr, next, mode))
continue;
p4d = boot_crst_alloc(_REGION2_ENTRY_EMPTY);
diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetlb.h
index 6983e52eaf81..e33a5b587ee4 100644
--- a/arch/s390/include/asm/hugetlb.h
+++ b/arch/s390/include/asm/hugetlb.h
@@ -41,7 +41,7 @@ static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, unsigned long sz)
{
- if ((pte_val(*ptep) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
+ if ((pte_val(ptep_get(ptep)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
set_pte(ptep, __pte(_REGION3_ENTRY_EMPTY));
else
set_pte(ptep, __pte(_SEGMENT_ENTRY_EMPTY));
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 67f5df20a57e..42688ea4337f 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -983,22 +983,39 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
WRITE_ONCE(*ptep, pte);
}
-static inline void pgd_clear(pgd_t *pgd)
+#define ptep_get ptep_get
+static inline pte_t ptep_get(pte_t *ptep)
{
- if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R1)
- set_pgd(pgd, __pgd(_REGION1_ENTRY_EMPTY));
+ return READ_ONCE(*ptep);
}
-static inline void p4d_clear(p4d_t *p4d)
+#define pmdp_get pmdp_get
+static inline pmd_t pmdp_get(pmd_t *pmdp)
{
- if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R2)
- set_p4d(p4d, __p4d(_REGION2_ENTRY_EMPTY));
+ return READ_ONCE(*pmdp);
}
-static inline void pud_clear(pud_t *pud)
+#define pudp_get pudp_get
+static inline pud_t pudp_get(pud_t *pudp)
{
- if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
- set_pud(pud, __pud(_REGION3_ENTRY_EMPTY));
+ return READ_ONCE(*pudp);
+}
+
+#define p4dp_get p4dp_get
+static inline p4d_t p4dp_get(p4d_t *p4dp)
+{
+ return READ_ONCE(*p4dp);
+}
+
+#define pgdp_get pgdp_get
+static inline pgd_t pgdp_get(pgd_t *pgdp)
+{
+ return READ_ONCE(*pgdp);
+}
+
+static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+{
+ set_pte(ptep, __pte(_PAGE_INVALID));
}
static inline void pmd_clear(pmd_t *pmdp)
@@ -1006,9 +1023,22 @@ static inline void pmd_clear(pmd_t *pmdp)
set_pmd(pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
}
-static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+static inline void pud_clear(pud_t *pud)
{
- set_pte(ptep, __pte(_PAGE_INVALID));
+ if ((pud_val(pudp_get(pud)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
+ set_pud(pud, __pud(_REGION3_ENTRY_EMPTY));
+}
+
+static inline void p4d_clear(p4d_t *p4d)
+{
+ if ((p4d_val(p4dp_get(p4d)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R2)
+ set_p4d(p4d, __p4d(_REGION2_ENTRY_EMPTY));
+}
+
+static inline void pgd_clear(pgd_t *pgd)
+{
+ if ((pgd_val(pgdp_get(pgd)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R1)
+ set_pgd(pgd, __pgd(_REGION1_ENTRY_EMPTY));
}
/*
@@ -1169,7 +1199,7 @@ pte_t ptep_xchg_lazy(struct mm_struct *, unsigned long, pte_t *, pte_t);
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
- pte_t pte = *ptep;
+ pte_t pte = ptep_get(ptep);
pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
return pte_young(pte);
@@ -1230,7 +1260,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
pte_t res;
if (full) {
- res = *ptep;
+ res = ptep_get(ptep);
set_pte(ptep, __pte(_PAGE_INVALID));
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
@@ -1262,7 +1292,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
- pte_t pte = *ptep;
+ pte_t pte = ptep_get(ptep);
if (pte_write(pte))
ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte));
@@ -1298,7 +1328,7 @@ static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
* PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
* A local RDP can be used to do the flush.
*/
- if (cpu_has_rdp() && !(pte_val(*ptep) & _PAGE_PROTECT))
+ if (cpu_has_rdp() && !(pte_val(ptep_get(ptep)) & _PAGE_PROTECT))
__ptep_rdp(address, ptep, 1);
}
#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index 302ef5781b65..db35d8fe8609 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -143,7 +143,7 @@ void __set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
rste = __pte_to_rste(pte);
/* Set correct table type for 2G hugepages */
- if ((pte_val(*ptep) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3) {
+ if ((pte_val(ptep_get(ptep)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3) {
if (likely(pte_present(pte)))
rste |= _REGION3_ENTRY_LARGE;
rste |= _REGION_ENTRY_TYPE_R3;
@@ -161,7 +161,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
- return __rste_to_pte(pte_val(*ptep));
+ return __rste_to_pte(pte_val(ptep_get(ptep)));
}
pte_t __huge_ptep_get_and_clear(struct mm_struct *mm,
@@ -171,7 +171,7 @@ pte_t __huge_ptep_get_and_clear(struct mm_struct *mm,
pmd_t *pmdp = (pmd_t *) ptep;
pud_t *pudp = (pud_t *) ptep;
- if ((pte_val(*ptep) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
+ if ((pte_val(ptep_get(ptep)) & _REGION_ENTRY_TYPE_MASK) == _REGION_ENTRY_TYPE_R3)
pudp_xchg_direct(mm, addr, pudp, __pud(_REGION3_ENTRY_EMPTY));
else
pmdp_xchg_direct(mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
@@ -209,13 +209,13 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
pmd_t *pmdp = NULL;
pgdp = pgd_offset(mm, addr);
- if (pgd_present(*pgdp)) {
+ if (pgd_present(pgdp_get(pgdp))) {
p4dp = p4d_offset(pgdp, addr);
- if (p4d_present(*p4dp)) {
+ if (p4d_present(p4dp_get(p4dp))) {
pudp = pud_offset(p4dp, addr);
if (sz == PUD_SIZE)
return (pte_t *)pudp;
- if (pud_present(*pudp))
+ if (pud_present(pudp_get(pudp)))
pmdp = pmd_offset(pudp, addr);
}
}
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index bb29c38ae624..3a54860cb05f 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -85,7 +85,7 @@ static int walk_pte_level(pmd_t *pmdp, unsigned long addr, unsigned long end,
return 0;
ptep = pte_offset_kernel(pmdp, addr);
do {
- new = *ptep;
+ new = ptep_get(ptep);
if (pte_none(new))
return -EINVAL;
if (flags & SET_MEMORY_RO)
@@ -114,15 +114,16 @@ static int split_pmd_page(pmd_t *pmdp, unsigned long addr)
{
unsigned long pte_addr, prot;
pte_t *pt_dir, *ptep;
- pmd_t new;
+ pmd_t new, pmd;
int i, ro, nx;
pt_dir = vmem_pte_alloc();
if (!pt_dir)
return -ENOMEM;
- pte_addr = pmd_pfn(*pmdp) << PAGE_SHIFT;
- ro = !!(pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT);
- nx = !!(pmd_val(*pmdp) & _SEGMENT_ENTRY_NOEXEC);
+ pmd = pmdp_get(pmdp);
+ pte_addr = pmd_pfn(pmd) << PAGE_SHIFT;
+ ro = !!(pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT);
+ nx = !!(pmd_val(pmd) & _SEGMENT_ENTRY_NOEXEC);
prot = pgprot_val(ro ? PAGE_KERNEL_RO : PAGE_KERNEL);
if (!nx)
prot &= ~_PAGE_NOEXEC;
@@ -142,7 +143,7 @@ static int split_pmd_page(pmd_t *pmdp, unsigned long addr)
static void modify_pmd_page(pmd_t *pmdp, unsigned long addr,
unsigned long flags)
{
- pmd_t new = *pmdp;
+ pmd_t new = pmdp_get(pmdp);
if (flags & SET_MEMORY_RO)
new = pmd_wrprotect(new);
@@ -165,16 +166,17 @@ static int walk_pmd_level(pud_t *pudp, unsigned long addr, unsigned long end,
unsigned long flags)
{
unsigned long next;
+ pmd_t *pmdp, pmd;
int need_split;
- pmd_t *pmdp;
int rc = 0;
pmdp = pmd_offset(pudp, addr);
do {
- if (pmd_none(*pmdp))
+ pmd = pmdp_get(pmdp);
+ if (pmd_none(pmd))
return -EINVAL;
next = pmd_addr_end(addr, end);
- if (pmd_leaf(*pmdp)) {
+ if (pmd_leaf(pmd)) {
need_split = !!(flags & SET_MEMORY_4K);
need_split |= !!(addr & ~PMD_MASK);
need_split |= !!(addr + PMD_SIZE > next);
@@ -201,15 +203,16 @@ int split_pud_page(pud_t *pudp, unsigned long addr)
{
unsigned long pmd_addr, prot;
pmd_t *pm_dir, *pmdp;
- pud_t new;
+ pud_t new, pud;
int i, ro, nx;
pm_dir = vmem_crst_alloc(_SEGMENT_ENTRY_EMPTY);
if (!pm_dir)
return -ENOMEM;
- pmd_addr = pud_pfn(*pudp) << PAGE_SHIFT;
- ro = !!(pud_val(*pudp) & _REGION_ENTRY_PROTECT);
- nx = !!(pud_val(*pudp) & _REGION_ENTRY_NOEXEC);
+ pud = pudp_get(pudp);
+ pmd_addr = pud_pfn(pud) << PAGE_SHIFT;
+ ro = !!(pud_val(pud) & _REGION_ENTRY_PROTECT);
+ nx = !!(pud_val(pud) & _REGION_ENTRY_NOEXEC);
prot = pgprot_val(ro ? SEGMENT_KERNEL_RO : SEGMENT_KERNEL);
if (!nx)
prot &= ~_SEGMENT_ENTRY_NOEXEC;
@@ -229,7 +232,7 @@ int split_pud_page(pud_t *pudp, unsigned long addr)
static void modify_pud_page(pud_t *pudp, unsigned long addr,
unsigned long flags)
{
- pud_t new = *pudp;
+ pud_t new = pudp_get(pudp);
if (flags & SET_MEMORY_RO)
new = pud_wrprotect(new);
@@ -252,16 +255,17 @@ static int walk_pud_level(p4d_t *p4d, unsigned long addr, unsigned long end,
unsigned long flags)
{
unsigned long next;
+ pud_t *pudp, pud;
int need_split;
- pud_t *pudp;
int rc = 0;
pudp = pud_offset(p4d, addr);
do {
- if (pud_none(*pudp))
+ pud = pudp_get(pudp);
+ if (pud_none(pud))
return -EINVAL;
next = pud_addr_end(addr, end);
- if (pud_leaf(*pudp)) {
+ if (pud_leaf(pud)) {
need_split = !!(flags & SET_MEMORY_4K);
need_split |= !!(addr & ~PUD_MASK);
need_split |= !!(addr + PUD_SIZE > next);
@@ -291,7 +295,7 @@ static int walk_p4d_level(pgd_t *pgd, unsigned long addr, unsigned long end,
p4dp = p4d_offset(pgd, addr);
do {
- if (p4d_none(*p4dp))
+ if (p4d_none(p4dp_get(p4dp)))
return -EINVAL;
next = p4d_addr_end(addr, end);
rc = walk_pud_level(p4dp, addr, next, flags);
@@ -313,7 +317,7 @@ static int change_page_attr(unsigned long addr, unsigned long end,
pgdp = pgd_offset_k(addr);
do {
- if (pgd_none(*pgdp))
+ if (pgd_none(pgdp_get(pgdp)))
break;
next = pgd_addr_end(addr, end);
rc = walk_p4d_level(pgdp, addr, next, flags);
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index eeadff45e0e1..803099f3db73 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -171,18 +171,19 @@ static int __ref modify_pte_table(pmd_t *pmd, unsigned long addr,
{
unsigned long prot, pages = 0;
int ret = -ENOMEM;
- pte_t *pte;
+ pte_t *pte, entry;
prot = pgprot_val(PAGE_KERNEL);
pte = pte_offset_kernel(pmd, addr);
for (; addr < end; addr += PAGE_SIZE, pte++) {
+ entry = ptep_get(pte);
if (!add) {
- if (pte_none(*pte))
+ if (pte_none(entry))
continue;
if (!direct)
- vmem_free_pages((unsigned long)pfn_to_virt(pte_pfn(*pte)), get_order(PAGE_SIZE), altmap);
+ vmem_free_pages((unsigned long)pfn_to_virt(pte_pfn(entry)), get_order(PAGE_SIZE), altmap);
pte_clear(&init_mm, addr, pte);
- } else if (pte_none(*pte)) {
+ } else if (pte_none(entry)) {
if (!direct) {
void *new_page = vmemmap_alloc_block_buf(PAGE_SIZE, NUMA_NO_NODE, altmap);
@@ -212,10 +213,10 @@ static void try_free_pte_table(pmd_t *pmd, unsigned long start)
/* We can safely assume this is fully in 1:1 mapping & vmemmap area */
pte = pte_offset_kernel(pmd, start);
for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
- if (!pte_none(*pte))
+ if (!pte_none(ptep_get(pte)))
return;
}
- vmem_pte_free((unsigned long *) pmd_deref(*pmd));
+ vmem_pte_free((unsigned long *)pmd_deref(pmdp_get(pmd)));
pmd_clear(pmd);
}
@@ -226,6 +227,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
{
unsigned long next, prot, pages = 0;
int ret = -ENOMEM;
+ pmd_t entry;
pmd_t *pmd;
pte_t *pte;
@@ -233,23 +235,24 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
pmd = pmd_offset(pud, addr);
for (; addr < end; addr = next, pmd++) {
next = pmd_addr_end(addr, end);
+ entry = pmdp_get(pmd);
if (!add) {
- if (pmd_none(*pmd))
+ if (pmd_none(entry))
continue;
- if (pmd_leaf(*pmd)) {
+ if (pmd_leaf(entry)) {
if (IS_ALIGNED(addr, PMD_SIZE) &&
IS_ALIGNED(next, PMD_SIZE)) {
if (!direct)
- vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE), altmap);
+ vmem_free_pages(pmd_deref(entry), get_order(PMD_SIZE), altmap);
pmd_clear(pmd);
pages++;
} else if (!direct && vmemmap_unuse_sub_pmd(addr, next)) {
- vmem_free_pages(pmd_deref(*pmd), get_order(PMD_SIZE), altmap);
+ vmem_free_pages(pmd_deref(entry), get_order(PMD_SIZE), altmap);
pmd_clear(pmd);
}
continue;
}
- } else if (pmd_none(*pmd)) {
+ } else if (pmd_none(entry)) {
if (IS_ALIGNED(addr, PMD_SIZE) &&
IS_ALIGNED(next, PMD_SIZE) &&
cpu_has_edat1() && direct &&
@@ -281,7 +284,7 @@ static int __ref modify_pmd_table(pud_t *pud, unsigned long addr,
if (!pte)
goto out;
pmd_populate(&init_mm, pmd, pte);
- } else if (pmd_leaf(*pmd)) {
+ } else if (pmd_leaf(entry)) {
if (!direct)
vmemmap_use_sub_pmd(addr, next);
continue;
@@ -306,9 +309,9 @@ static void try_free_pmd_table(pud_t *pud, unsigned long start)
pmd = pmd_offset(pud, start);
for (i = 0; i < PTRS_PER_PMD; i++, pmd++)
- if (!pmd_none(*pmd))
+ if (!pmd_none(pmdp_get(pmd)))
return;
- vmem_free_pages(pud_deref(*pud), CRST_ALLOC_ORDER, NULL);
+ vmem_free_pages(pud_deref(pudp_get(pud)), CRST_ALLOC_ORDER, NULL);
pud_clear(pud);
}
@@ -317,21 +320,22 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
{
unsigned long next, prot, pages = 0;
int ret = -ENOMEM;
- pud_t *pud;
+ pud_t *pud, entry;
pmd_t *pmd;
prot = pgprot_val(REGION3_KERNEL);
pud = pud_offset(p4d, addr);
for (; addr < end; addr = next, pud++) {
next = pud_addr_end(addr, end);
+ entry = pudp_get(pud);
if (!add) {
- if (pud_none(*pud))
+ if (pud_none(entry))
continue;
- if (pud_leaf(*pud)) {
+ if (pud_leaf(entry)) {
if (IS_ALIGNED(addr, PUD_SIZE) &&
IS_ALIGNED(next, PUD_SIZE)) {
if (!direct)
- vmem_free_pages(pud_deref(*pud), get_order(PUD_SIZE), altmap);
+ vmem_free_pages(pud_deref(entry), get_order(PUD_SIZE), altmap);
pud_clear(pud);
pages++;
continue;
@@ -339,7 +343,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
split_pud_page(pud, addr & PUD_MASK);
}
}
- } else if (pud_none(*pud)) {
+ } else if (pud_none(entry)) {
if (IS_ALIGNED(addr, PUD_SIZE) &&
IS_ALIGNED(next, PUD_SIZE) &&
cpu_has_edat2() && direct &&
@@ -352,7 +356,7 @@ static int modify_pud_table(p4d_t *p4d, unsigned long addr, unsigned long end,
if (!pmd)
goto out;
pud_populate(&init_mm, pud, pmd);
- } else if (pud_leaf(*pud)) {
+ } else if (pud_leaf(entry)) {
continue;
}
ret = modify_pmd_table(pud, addr, next, add, direct, altmap);
@@ -375,10 +379,10 @@ static void try_free_pud_table(p4d_t *p4d, unsigned long start)
pud = pud_offset(p4d, start);
for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
- if (!pud_none(*pud))
+ if (!pud_none(pudp_get(pud)))
return;
}
- vmem_free_pages(p4d_deref(*p4d), CRST_ALLOC_ORDER, NULL);
+ vmem_free_pages(p4d_deref(p4dp_get(p4d)), CRST_ALLOC_ORDER, NULL);
p4d_clear(p4d);
}
@@ -387,16 +391,17 @@ static int modify_p4d_table(pgd_t *pgd, unsigned long addr, unsigned long end,
{
unsigned long next;
int ret = -ENOMEM;
- p4d_t *p4d;
+ p4d_t *p4d, entry;
pud_t *pud;
p4d = p4d_offset(pgd, addr);
for (; addr < end; addr = next, p4d++) {
next = p4d_addr_end(addr, end);
+ entry = p4dp_get(p4d);
if (!add) {
- if (p4d_none(*p4d))
+ if (p4d_none(entry))
continue;
- } else if (p4d_none(*p4d)) {
+ } else if (p4d_none(entry)) {
pud = vmem_crst_alloc(_REGION3_ENTRY_EMPTY);
if (!pud)
goto out;
@@ -420,10 +425,10 @@ static void try_free_p4d_table(pgd_t *pgd, unsigned long start)
p4d = p4d_offset(pgd, start);
for (i = 0; i < PTRS_PER_P4D; i++, p4d++) {
- if (!p4d_none(*p4d))
+ if (!p4d_none(p4dp_get(p4d)))
return;
}
- vmem_free_pages(pgd_deref(*pgd), CRST_ALLOC_ORDER, NULL);
+ vmem_free_pages(pgd_deref(pgdp_get(pgd)), CRST_ALLOC_ORDER, NULL);
pgd_clear(pgd);
}
@@ -432,7 +437,7 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
{
unsigned long addr, next;
int ret = -ENOMEM;
- pgd_t *pgd;
+ pgd_t *pgd, entry;
p4d_t *p4d;
if (WARN_ON_ONCE(!PAGE_ALIGNED(start | end)))
@@ -449,11 +454,12 @@ static int modify_pagetable(unsigned long start, unsigned long end, bool add,
for (addr = start; addr < end; addr = next) {
next = pgd_addr_end(addr, end);
pgd = pgd_offset_k(addr);
+ entry = pgdp_get(pgd);
if (!add) {
- if (pgd_none(*pgd))
+ if (pgd_none(entry))
continue;
- } else if (pgd_none(*pgd)) {
+ } else if (pgd_none(entry)) {
p4d = vmem_crst_alloc(_REGION2_ENTRY_EMPTY);
if (!p4d)
goto out;
@@ -575,6 +581,8 @@ int vmem_add_mapping(unsigned long start, unsigned long size)
pte_t *vmem_get_alloc_pte(unsigned long addr, bool alloc)
{
pte_t *ptep = NULL;
+ pud_t pud_entry;
+ pmd_t pmd_entry;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -582,7 +590,7 @@ pte_t *vmem_get_alloc_pte(unsigned long addr, bool alloc)
pte_t *pte;
pgd = pgd_offset_k(addr);
- if (pgd_none(*pgd)) {
+ if (pgd_none(pgdp_get(pgd))) {
if (!alloc)
goto out;
p4d = vmem_crst_alloc(_REGION2_ENTRY_EMPTY);
@@ -591,7 +599,7 @@ pte_t *vmem_get_alloc_pte(unsigned long addr, bool alloc)
pgd_populate(&init_mm, pgd, p4d);
}
p4d = p4d_offset(pgd, addr);
- if (p4d_none(*p4d)) {
+ if (p4d_none(p4dp_get(p4d))) {
if (!alloc)
goto out;
pud = vmem_crst_alloc(_REGION3_ENTRY_EMPTY);
@@ -600,25 +608,27 @@ pte_t *vmem_get_alloc_pte(unsigned long addr, bool alloc)
p4d_populate(&init_mm, p4d, pud);
}
pud = pud_offset(p4d, addr);
- if (pud_none(*pud)) {
+ pud_entry = pudp_get(pud);
+ if (pud_none(pud_entry)) {
if (!alloc)
goto out;
pmd = vmem_crst_alloc(_SEGMENT_ENTRY_EMPTY);
if (!pmd)
goto out;
pud_populate(&init_mm, pud, pmd);
- } else if (WARN_ON_ONCE(pud_leaf(*pud))) {
+ } else if (WARN_ON_ONCE(pud_leaf(pud_entry))) {
goto out;
}
pmd = pmd_offset(pud, addr);
- if (pmd_none(*pmd)) {
+ pmd_entry = pmdp_get(pmd);
+ if (pmd_none(pmd_entry)) {
if (!alloc)
goto out;
pte = vmem_pte_alloc();
if (!pte)
goto out;
pmd_populate(&init_mm, pmd, pte);
- } else if (WARN_ON_ONCE(pmd_leaf(*pmd))) {
+ } else if (WARN_ON_ONCE(pmd_leaf(pmd_entry))) {
goto out;
}
ptep = pte_offset_kernel(pmd, addr);
--
2.51.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 4/6] s390/mm: Make PTC and UV call order consistent
2026-04-15 15:01 [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
` (2 preceding siblings ...)
2026-04-15 15:01 ` [PATCH v2 3/6] s390/mm: Complete ptep_get() conversion Alexander Gordeev
@ 2026-04-15 15:01 ` Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 5/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 6/6] s390/mm: Allow lazy MMU mode disabling Alexander Gordeev
5 siblings, 0 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-04-15 15:01 UTC (permalink / raw)
To: Kevin Brodsky, David Hildenbrand, Andrew Morton, Gerald Schaefer,
Heiko Carstens, Christian Borntraeger, Vasily Gorbik,
Claudio Imbrenda
Cc: linux-s390, linux-mm, linux-kernel
In various code paths, page_table_check_pte_clear() is called
before converting a secure page, while in others it is called
after. Make this consistent and always perform the conversion
after the PTC hook has been called. Also make all conversion‑
eligibility condition checks look the same, and rework the one
in ptep_get_and_clear_full() slightly.
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
arch/s390/include/asm/pgtable.h | 39 +++++++++++++++------------------
1 file changed, 18 insertions(+), 21 deletions(-)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 42688ea4337f..010a33fec867 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1219,10 +1219,10 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
pte_t res;
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+ page_table_check_pte_clear(mm, addr, res);
/* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
WARN_ON_ONCE(uv_convert_from_secure_pte(res));
- page_table_check_pte_clear(mm, addr, res);
return res;
}
@@ -1238,10 +1238,10 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
pte_t res;
res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
+ page_table_check_pte_clear(vma->vm_mm, addr, res);
/* At this point the reference through the mapping is still present */
if (mm_is_protected(vma->vm_mm) && pte_present(res))
WARN_ON_ONCE(uv_convert_from_secure_pte(res));
- page_table_check_pte_clear(vma->vm_mm, addr, res);
return res;
}
@@ -1265,26 +1265,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
-
page_table_check_pte_clear(mm, addr, res);
-
- /* Nothing to do */
- if (!mm_is_protected(mm) || !pte_present(res))
- return res;
- /*
- * At this point the reference through the mapping is still present.
- * The notifier should have destroyed all protected vCPUs at this
- * point, so the destroy should be successful.
- */
- if (full && !uv_destroy_pte(res))
- return res;
- /*
- * If something went wrong and the page could not be destroyed, or
- * if this is not a mm teardown, the slower export is used as
- * fallback instead. If even that fails, print a warning and leak
- * the page, to avoid crashing the whole system.
- */
- WARN_ON_ONCE(uv_convert_from_secure_pte(res));
+ /* At this point the reference through the mapping is still present */
+ if (mm_is_protected(mm) && pte_present(res)) {
+ /*
+ * The notifier should have destroyed all protected vCPUs at
+ * this point, so the destroy should be successful.
+ */
+ if (full && !uv_destroy_pte(res))
+ return res;
+ /*
+ * If something went wrong and the page could not be destroyed,
+ * or if this is not a mm teardown, the slower export is used
+ * as fallback instead. If even that fails, print a warning and
+ * leak the page, to avoid crashing the whole system.
+ */
+ WARN_ON_ONCE(uv_convert_from_secure_pte(res));
+ }
return res;
}
--
2.51.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 5/6] s390/mm: Batch PTE updates in lazy MMU mode
2026-04-15 15:01 [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
` (3 preceding siblings ...)
2026-04-15 15:01 ` [PATCH v2 4/6] s390/mm: Make PTC and UV call order consistent Alexander Gordeev
@ 2026-04-15 15:01 ` Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 6/6] s390/mm: Allow lazy MMU mode disabling Alexander Gordeev
5 siblings, 0 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-04-15 15:01 UTC (permalink / raw)
To: Kevin Brodsky, David Hildenbrand, Andrew Morton, Gerald Schaefer,
Heiko Carstens, Christian Borntraeger, Vasily Gorbik,
Claudio Imbrenda
Cc: linux-s390, linux-mm, linux-kernel
Make use of the IPTE instruction's "Additional Entries" field to
invalidate multiple PTEs in one go while in lazy MMU mode. This
is the mode in which many memory-management system calls (like
mremap(), mprotect(), etc.) update memory attributes.
To achieve that, the set_pte() and ptep_get() primitives use a
per-CPU cache to store and retrieve PTE values and apply the
cached values to the real page table once lazy MMU mode is left.
The same is done for memory-management platform callbacks that
would otherwise cause intense per-PTE IPTE traffic, reducing the
number of IPTE instructions from up to PTRS_PER_PTE to a single
instruction in the best case. The average reduction is of course
smaller.
Since all existing page table iterators called in lazy MMU mode
handle one table at a time, the per-CPU cache does not need to be
larger than PTRS_PER_PTE entries. That also naturally aligns with
the IPTE instruction, which must not cross a page table boundary.
Before this change, the system calls did:
lazy_mmu_mode_enable_pte()
...
<update PTEs> // up to PTRS_PER_PTE single-IPTEs
...
lazy_mmu_mode_disable()
With this change, the system calls do:
lazy_mmu_mode_enable_pte()
...
<store new PTE values in the per-CPU cache>
...
lazy_mmu_mode_disable() // apply cache with one multi-IPTE
When applied to large memory ranges, some system calls show
significant speedups:
mprotect() ~15x
munmap() ~3x
mremap() ~28x
At the same time, fork() shows a measurable slowdown of ~1.5x.
The overall results depend on memory size and access patterns,
but the change generally does not degrade performance.
In addition to a process-wide impact, the rework affects the
whole Central Electronics Complex (CEC). Each (global) IPTE
instruction initiates a quiesce state in a CEC, so reducing
the number of IPTE calls relieves CEC-wide quiesce traffic.
In an extreme case of mprotect() contiguously triggering the
quiesce state on four LPARs in parallel, measurements show
~25x fewer quiesce events.
Assisted-by: gemini:gemini-3.1-pro-preview
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
arch/s390/Kconfig | 11 +
arch/s390/include/asm/pgtable.h | 169 ++++++++++++--
arch/s390/mm/Makefile | 1 +
arch/s390/mm/ipte_batch.c | 378 ++++++++++++++++++++++++++++++++
arch/s390/mm/pgtable.c | 8 +-
5 files changed, 545 insertions(+), 22 deletions(-)
create mode 100644 arch/s390/mm/ipte_batch.c
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 7828fbe0fc42..deffc819268e 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -732,6 +732,17 @@ config MAX_PHYSMEM_BITS
Increasing the number of bits also increases the kernel image size.
By default 46 bits (64TB) are supported.
+config IPTE_BATCH
+ def_bool y
+ prompt "Enables Additional Entries for IPTE instruction"
+ select ARCH_HAS_LAZY_MMU_MODE
+ help
+ This option enables using of "Additional Entries" field of the IPTE
+ instruction, which capitalizes on the lazy MMU mode infrastructure.
+ As result, multiple PTEs are invalidated in one go. That improves
+ performance of many memory-management system calls (like mremap(),
+ mprotect(), etc.) and decreases CEC-wide quiesce traffic.
+
endmenu
menu "I/O subsystem"
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 010a33fec867..39c5d672cf09 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -39,6 +39,71 @@ enum {
extern atomic_long_t direct_pages_count[PG_DIRECT_MAP_MAX];
+#if !defined(CONFIG_IPTE_BATCH) || defined(__DECOMPRESSOR)
+static inline
+bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ int *res)
+{
+ return false;
+}
+
+static inline
+bool ipte_batch_ptep_get_and_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, pte_t *res)
+{
+ return false;
+}
+
+static inline
+bool ipte_batch_ptep_modify_prot_start(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep, pte_t *res)
+{
+ return false;
+}
+
+static inline
+bool ipte_batch_ptep_modify_prot_commit(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t old_pte, pte_t pte)
+{
+ return false;
+}
+
+static inline
+bool ipte_batch_ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ return false;
+}
+
+static inline bool ipte_batch_set_pte(pte_t *ptep, pte_t pte)
+{
+ return false;
+}
+
+static inline bool ipte_batch_ptep_get(pte_t *ptep, pte_t *res)
+{
+ return false;
+}
+#else
+bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ int *res);
+bool ipte_batch_ptep_get_and_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, pte_t *res);
+bool ipte_batch_ptep_modify_prot_start(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep, pte_t *res);
+bool ipte_batch_ptep_modify_prot_commit(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t old_pte, pte_t pte);
+
+bool ipte_batch_ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep);
+bool ipte_batch_set_pte(pte_t *ptep, pte_t pte);
+bool ipte_batch_ptep_get(pte_t *ptep, pte_t *res);
+#endif
+
static inline void update_page_count(int level, long count)
{
if (IS_ENABLED(CONFIG_PROC_FS))
@@ -978,15 +1043,30 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
WRITE_ONCE(*pmdp, pmd);
}
-static inline void set_pte(pte_t *ptep, pte_t pte)
+static inline void __set_pte(pte_t *ptep, pte_t pte)
{
WRITE_ONCE(*ptep, pte);
}
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+ if (!ipte_batch_set_pte(ptep, pte))
+ __set_pte(ptep, pte);
+}
+
+static inline pte_t __ptep_get(pte_t *ptep)
+{
+ return READ_ONCE(*ptep);
+}
+
#define ptep_get ptep_get
static inline pte_t ptep_get(pte_t *ptep)
{
- return READ_ONCE(*ptep);
+ pte_t res;
+
+ if (!ipte_batch_ptep_get(ptep, &res))
+ res = __ptep_get(ptep);
+ return res;
}
#define pmdp_get pmdp_get
@@ -1179,6 +1259,20 @@ static __always_inline void __ptep_ipte_range(unsigned long address, int nr,
} while (nr != 255);
}
+#ifdef CONFIG_IPTE_BATCH
+void arch_enter_lazy_mmu_mode_for_pte_range(struct mm_struct *mm,
+ unsigned long addr, unsigned long end,
+ pte_t *pte);
+#define arch_enter_lazy_mmu_mode_for_pte_range arch_enter_lazy_mmu_mode_for_pte_range
+
+static inline void arch_enter_lazy_mmu_mode(void)
+{
+}
+
+void arch_leave_lazy_mmu_mode(void);
+void arch_flush_lazy_mmu_mode(void);
+#endif
+
/*
* This is hard to understand. ptep_get_and_clear and ptep_clear_flush
* both clear the TLB for the unmapped pte. The reason is that
@@ -1199,10 +1293,16 @@ pte_t ptep_xchg_lazy(struct mm_struct *, unsigned long, pte_t *, pte_t);
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
- pte_t pte = ptep_get(ptep);
+ pte_t pte;
+ int res;
- pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte));
- return pte_young(pte);
+ if (!ipte_batch_ptep_test_and_clear_young(vma, addr, ptep, &res)) {
+ pte = __ptep_get(ptep);
+ pte = pte_mkold(pte);
+ pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte);
+ res = pte_young(pte);
+ }
+ return res;
}
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1218,7 +1318,8 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
{
pte_t res;
- res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+ if (!ipte_batch_ptep_get_and_clear(mm, addr, ptep, &res))
+ res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
page_table_check_pte_clear(mm, addr, res);
/* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
@@ -1227,9 +1328,34 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
}
#define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
-pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
-void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
- pte_t *, pte_t, pte_t);
+pte_t ___ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
+void ___ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
+ pte_t *, pte_t, pte_t);
+
+static inline
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+ pte_t res;
+
+ if (!ipte_batch_ptep_modify_prot_start(vma, addr, ptep, &res))
+ res = ___ptep_modify_prot_start(vma, addr, ptep);
+ return res;
+}
+
+static inline
+void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *ptep, pte_t old_pte, pte_t pte)
+{
+ if (!ipte_batch_ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte))
+ ___ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
+}
+
+bool ipte_batch_ptep_modify_prot_start(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep, pte_t *res);
+bool ipte_batch_ptep_modify_prot_commit(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t old_pte, pte_t pte);
#define __HAVE_ARCH_PTEP_CLEAR_FLUSH
static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
@@ -1259,11 +1385,13 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
{
pte_t res;
- if (full) {
- res = ptep_get(ptep);
- set_pte(ptep, __pte(_PAGE_INVALID));
- } else {
- res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+ if (!ipte_batch_ptep_get_and_clear(mm, addr, ptep, &res)) {
+ if (full) {
+ res = __ptep_get(ptep);
+ __set_pte(ptep, __pte(_PAGE_INVALID));
+ } else {
+ res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+ }
}
page_table_check_pte_clear(mm, addr, res);
/* At this point the reference through the mapping is still present */
@@ -1289,10 +1417,15 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep)
{
- pte_t pte = ptep_get(ptep);
+ pte_t pte;
- if (pte_write(pte))
- ptep_xchg_lazy(mm, addr, ptep, pte_wrprotect(pte));
+ if (!ipte_batch_ptep_set_wrprotect(mm, addr, ptep)) {
+ pte = __ptep_get(ptep);
+ if (pte_write(pte)) {
+ pte = pte_wrprotect(pte);
+ ptep_xchg_lazy(mm, addr, ptep, pte);
+ }
+ }
}
/*
@@ -1325,7 +1458,7 @@ static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
* PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
* A local RDP can be used to do the flush.
*/
- if (cpu_has_rdp() && !(pte_val(ptep_get(ptep)) & _PAGE_PROTECT))
+ if (cpu_has_rdp() && !(pte_val(__ptep_get(ptep)) & _PAGE_PROTECT))
__ptep_rdp(address, ptep, 1);
}
#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index 193899c39ca7..0f6c6de447d4 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -11,5 +11,6 @@ obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_PTDUMP) += dump_pagetables.o
obj-$(CONFIG_PFAULT) += pfault.o
+obj-$(CONFIG_IPTE_BATCH) += ipte_batch.o
obj-$(subst m,y,$(CONFIG_KVM)) += gmap_helpers.o
diff --git a/arch/s390/mm/ipte_batch.c b/arch/s390/mm/ipte_batch.c
new file mode 100644
index 000000000000..cc4c50347d0f
--- /dev/null
+++ b/arch/s390/mm/ipte_batch.c
@@ -0,0 +1,378 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/pgtable.h>
+#include <asm/facility.h>
+#include <kunit/visibility.h>
+
+#define PTE_POISON 0
+
+struct ipte_batch {
+ struct mm_struct *mm;
+ unsigned long base_addr;
+ unsigned long base_end;
+ pte_t *base_pte;
+ pte_t *start_pte;
+ pte_t *end_pte;
+ pte_t cache[PTRS_PER_PTE];
+};
+
+static DEFINE_PER_CPU(struct ipte_batch, ipte_range);
+
+static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
+{
+ pte_t pte = __ptep_get(start);
+ pte_t *ptep;
+
+ *valid = !(pte_val(pte) & _PAGE_INVALID);
+
+ for (ptep = start + 1; ptep < end; ptep++) {
+ pte = __ptep_get(ptep);
+ if (*valid) {
+ if (pte_val(pte) & _PAGE_INVALID)
+ break;
+ } else {
+ if (!(pte_val(pte) & _PAGE_INVALID))
+ break;
+ }
+ }
+
+ return ptep - start;
+}
+
+static void __invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
+ int nr_ptes, pte_t *ptep)
+{
+ atomic_inc(&mm->context.flush_count);
+ if (cpu_has_tlb_lc() &&
+ cpumask_equal(mm_cpumask(mm), cpumask_of(smp_processor_id())))
+ __ptep_ipte_range(addr, nr_ptes - 1, ptep, IPTE_LOCAL);
+ else
+ __ptep_ipte_range(addr, nr_ptes - 1, ptep, IPTE_GLOBAL);
+ atomic_dec(&mm->context.flush_count);
+}
+
+static int invalidate_pte_range(struct mm_struct *mm, unsigned long addr,
+ pte_t *start, pte_t *end)
+{
+ int nr_ptes;
+ bool valid;
+
+ nr_ptes = count_contiguous(start, end, &valid);
+ if (valid)
+ __invalidate_pte_range(mm, addr, nr_ptes, start);
+
+ return nr_ptes;
+}
+
+static void set_pte_range(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t *end, pte_t *cache)
+{
+ int i, nr_ptes;
+
+ while (ptep < end) {
+ nr_ptes = invalidate_pte_range(mm, addr, ptep, end);
+
+ for (i = 0; i < nr_ptes; i++, ptep++, cache++) {
+ __set_pte(ptep, *cache);
+ *cache = __pte(PTE_POISON);
+ }
+
+ addr += nr_ptes * PAGE_SIZE;
+ }
+}
+
+static void enter_ipte_batch(struct mm_struct *mm,
+ unsigned long addr, unsigned long end, pte_t *pte)
+{
+ struct ipte_batch *ib;
+
+ ib = &get_cpu_var(ipte_range);
+
+ ib->mm = mm;
+ ib->base_addr = addr;
+ ib->base_end = end;
+ ib->base_pte = pte;
+}
+
+static void leave_ipte_batch(void)
+{
+ pte_t *ptep, *start, *start_cache, *cache;
+ unsigned long start_addr, addr;
+ struct ipte_batch *ib;
+ int start_idx;
+
+ ib = &get_cpu_var(ipte_range);
+ if (!ib->mm) {
+ put_cpu_var(ipte_range);
+ return;
+ }
+ put_cpu_var(ipte_range);
+
+ lockdep_assert_preemption_disabled();
+ if (!ib->start_pte)
+ goto done;
+
+ start = ib->start_pte;
+ start_idx = ib->start_pte - ib->base_pte;
+ start_addr = ib->base_addr + start_idx * PAGE_SIZE;
+ addr = start_addr;
+ start_cache = &ib->cache[start_idx];
+ cache = start_cache;
+ for (ptep = start; ptep < ib->end_pte; ptep++, cache++, addr += PAGE_SIZE) {
+ if (pte_val(*cache) == PTE_POISON) {
+ if (start) {
+ set_pte_range(ib->mm, start_addr, start, ptep, start_cache);
+ start = NULL;
+ }
+ } else if (!start) {
+ start = ptep;
+ start_addr = addr;
+ start_cache = cache;
+ }
+ }
+ set_pte_range(ib->mm, start_addr, start, ptep, start_cache);
+
+ ib->start_pte = NULL;
+ ib->end_pte = NULL;
+
+done:
+ ib->mm = NULL;
+ ib->base_addr = 0;
+ ib->base_end = 0;
+ ib->base_pte = NULL;
+
+ put_cpu_var(ipte_range);
+}
+
+static void flush_lazy_mmu_mode(void)
+{
+ unsigned long addr, end;
+ struct ipte_batch *ib;
+ struct mm_struct *mm;
+ pte_t *pte;
+
+ ib = &get_cpu_var(ipte_range);
+ if (ib->mm) {
+ mm = ib->mm;
+ addr = ib->base_addr;
+ end = ib->base_end;
+ pte = ib->base_pte;
+
+ leave_ipte_batch();
+ enter_ipte_batch(mm, addr, end, pte);
+ }
+ put_cpu_var(ipte_range);
+}
+
+void arch_enter_lazy_mmu_mode_for_pte_range(struct mm_struct *mm,
+ unsigned long addr, unsigned long end,
+ pte_t *pte)
+{
+ if (!test_facility(13))
+ return;
+ enter_ipte_batch(mm, addr, end, pte);
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_enter_lazy_mmu_mode_for_pte_range);
+
+void arch_leave_lazy_mmu_mode(void)
+{
+ if (!test_facility(13))
+ return;
+ leave_ipte_batch();
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_leave_lazy_mmu_mode);
+
+void arch_flush_lazy_mmu_mode(void)
+{
+ if (!test_facility(13))
+ return;
+ flush_lazy_mmu_mode();
+}
+EXPORT_SYMBOL_IF_KUNIT(arch_flush_lazy_mmu_mode);
+
+static void __ipte_batch_set_pte(struct ipte_batch *ib, pte_t *ptep, pte_t pte)
+{
+ unsigned int idx = ptep - ib->base_pte;
+
+ lockdep_assert_preemption_disabled();
+ ib->cache[idx] = pte;
+
+ if (!ib->start_pte) {
+ ib->start_pte = ptep;
+ ib->end_pte = ptep + 1;
+ } else if (ptep < ib->start_pte) {
+ ib->start_pte = ptep;
+ } else if (ptep + 1 > ib->end_pte) {
+ ib->end_pte = ptep + 1;
+ }
+}
+
+static pte_t __ipte_batch_ptep_get(struct ipte_batch *ib, pte_t *ptep)
+{
+ unsigned int idx = ptep - ib->base_pte;
+
+ lockdep_assert_preemption_disabled();
+ if (pte_val(ib->cache[idx]) == PTE_POISON)
+ return __ptep_get(ptep);
+ return ib->cache[idx];
+}
+
+static bool lazy_mmu_mode(struct ipte_batch *ib, struct mm_struct *mm, pte_t *ptep)
+{
+ unsigned int nr_ptes;
+
+ lockdep_assert_preemption_disabled();
+ if (!is_lazy_mmu_mode_active())
+ return false;
+ if (!mm)
+ return false;
+ if (!ib->mm)
+ return false;
+ if (ptep < ib->base_pte)
+ return false;
+ nr_ptes = (ib->base_end - ib->base_addr) / PAGE_SIZE;
+ if (ptep >= ib->base_pte + nr_ptes)
+ return false;
+ return true;
+}
+
+static struct ipte_batch *get_ipte_batch_nomm(pte_t *ptep)
+{
+ struct ipte_batch *ib;
+
+ ib = &get_cpu_var(ipte_range);
+ if (!lazy_mmu_mode(ib, ib->mm, ptep)) {
+ put_cpu_var(ipte_range);
+ return NULL;
+ }
+
+ return ib;
+}
+
+static struct ipte_batch *get_ipte_batch(struct mm_struct *mm, pte_t *ptep)
+{
+ struct ipte_batch *ib;
+
+ ib = &get_cpu_var(ipte_range);
+ if (!lazy_mmu_mode(ib, mm, ptep)) {
+ put_cpu_var(ipte_range);
+ return NULL;
+ }
+
+ return ib;
+}
+
+static void put_ipte_batch(struct ipte_batch *ib)
+{
+ put_cpu_var(ipte_range);
+}
+
+bool ipte_batch_set_pte(pte_t *ptep, pte_t pte)
+{
+ struct ipte_batch *ib;
+
+ ib = get_ipte_batch_nomm(ptep);
+ if (!ib)
+ return false;
+ __ipte_batch_set_pte(ib, ptep, pte);
+ put_ipte_batch(ib);
+
+ return true;
+}
+
+bool ipte_batch_ptep_get(pte_t *ptep, pte_t *res)
+{
+ struct ipte_batch *ib;
+
+ ib = get_ipte_batch_nomm(ptep);
+ if (!ib)
+ return false;
+ *res = __ipte_batch_ptep_get(ib, ptep);
+ put_ipte_batch(ib);
+
+ return true;
+}
+
+bool ipte_batch_ptep_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ int *res)
+{
+ struct ipte_batch *ib;
+ pte_t pte, old;
+
+ ib = get_ipte_batch(vma->vm_mm, ptep);
+ if (!ib)
+ return false;
+
+ old = __ipte_batch_ptep_get(ib, ptep);
+ pte = pte_mkold(old);
+ __ipte_batch_set_pte(ib, ptep, pte);
+
+ put_ipte_batch(ib);
+
+ *res = pte_young(old);
+
+ return true;
+}
+
+bool ipte_batch_ptep_get_and_clear(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep, pte_t *res)
+{
+ struct ipte_batch *ib;
+ pte_t pte, old;
+
+ ib = get_ipte_batch(mm, ptep);
+ if (!ib)
+ return false;
+
+ old = __ipte_batch_ptep_get(ib, ptep);
+ pte = __pte(_PAGE_INVALID);
+ __ipte_batch_set_pte(ib, ptep, pte);
+
+ put_ipte_batch(ib);
+
+ *res = old;
+
+ return true;
+}
+
+bool ipte_batch_ptep_modify_prot_start(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep, pte_t *res)
+{
+ return ipte_batch_ptep_get_and_clear(vma->vm_mm, addr, ptep, res);
+}
+
+bool ipte_batch_ptep_modify_prot_commit(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ pte_t old_pte, pte_t pte)
+{
+ struct ipte_batch *ib;
+
+ ib = get_ipte_batch(vma->vm_mm, ptep);
+ if (!ib)
+ return false;
+ __ipte_batch_set_pte(ib, ptep, pte);
+ put_ipte_batch(ib);
+
+ return true;
+}
+
+bool ipte_batch_ptep_set_wrprotect(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
+{
+ struct ipte_batch *ib;
+ pte_t pte;
+
+ ib = get_ipte_batch(mm, ptep);
+ if (!ib)
+ return false;
+
+ pte = __ipte_batch_ptep_get(ib, ptep);
+ if (pte_write(pte)) {
+ pte = pte_wrprotect(pte);
+ __ipte_batch_set_pte(ib, ptep, pte);
+ }
+
+ put_ipte_batch(ib);
+
+ return true;
+}
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 4acd8b140c4b..df36523bcbbb 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -166,14 +166,14 @@ pte_t ptep_xchg_lazy(struct mm_struct *mm, unsigned long addr,
}
EXPORT_SYMBOL(ptep_xchg_lazy);
-pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
- pte_t *ptep)
+pte_t ___ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *ptep)
{
return ptep_flush_lazy(vma->vm_mm, addr, ptep, 1);
}
-void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
- pte_t *ptep, pte_t old_pte, pte_t pte)
+void ___ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
+ pte_t *ptep, pte_t old_pte, pte_t pte)
{
set_pte(ptep, pte);
}
--
2.51.0
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 6/6] s390/mm: Allow lazy MMU mode disabling
2026-04-15 15:01 [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
` (4 preceding siblings ...)
2026-04-15 15:01 ` [PATCH v2 5/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
@ 2026-04-15 15:01 ` Alexander Gordeev
5 siblings, 0 replies; 7+ messages in thread
From: Alexander Gordeev @ 2026-04-15 15:01 UTC (permalink / raw)
To: Kevin Brodsky, David Hildenbrand, Andrew Morton, Gerald Schaefer,
Heiko Carstens, Christian Borntraeger, Vasily Gorbik,
Claudio Imbrenda
Cc: linux-s390, linux-mm, linux-kernel
Inroduce "lazy_mmu" kernel command line parameter
to allow disabling of the lazy MMU mode on boot.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
arch/s390/mm/ipte_batch.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/arch/s390/mm/ipte_batch.c b/arch/s390/mm/ipte_batch.c
index cc4c50347d0f..cd86daeba7ec 100644
--- a/arch/s390/mm/ipte_batch.c
+++ b/arch/s390/mm/ipte_batch.c
@@ -16,6 +16,22 @@ struct ipte_batch {
};
static DEFINE_PER_CPU(struct ipte_batch, ipte_range);
+static DEFINE_STATIC_KEY_TRUE_RO(lazy_mmu);
+
+static int __init setup_lazy_mmu(char *str)
+{
+ bool enable;
+
+ if (kstrtobool(str, &enable)) {
+ pr_warn("Failed to setup lazy MMU mode, set to enabled\n");
+ } else if (!enable) {
+ pr_warn("Disabling lazy MMU mode\n");
+ static_key_disable(&lazy_mmu.key);
+ }
+
+ return 0;
+}
+early_param("lazy_mmu", setup_lazy_mmu);
static int count_contiguous(pte_t *start, pte_t *end, bool *valid)
{
@@ -169,6 +185,8 @@ void arch_enter_lazy_mmu_mode_for_pte_range(struct mm_struct *mm,
{
if (!test_facility(13))
return;
+ if (!static_branch_likely(&lazy_mmu))
+ return;
enter_ipte_batch(mm, addr, end, pte);
}
EXPORT_SYMBOL_IF_KUNIT(arch_enter_lazy_mmu_mode_for_pte_range);
@@ -177,6 +195,8 @@ void arch_leave_lazy_mmu_mode(void)
{
if (!test_facility(13))
return;
+ if (!static_branch_likely(&lazy_mmu))
+ return;
leave_ipte_batch();
}
EXPORT_SYMBOL_IF_KUNIT(arch_leave_lazy_mmu_mode);
@@ -185,6 +205,8 @@ void arch_flush_lazy_mmu_mode(void)
{
if (!test_facility(13))
return;
+ if (!static_branch_likely(&lazy_mmu))
+ return;
flush_lazy_mmu_mode();
}
EXPORT_SYMBOL_IF_KUNIT(arch_flush_lazy_mmu_mode);
--
2.51.0
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-04-15 15:01 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-15 15:01 [PATCH v2 0/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 1/6] mm: Make lazy MMU mode context-aware Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 2/6] mm/pgtable: Fix bogus comment to clear_not_present_full_ptes() Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 3/6] s390/mm: Complete ptep_get() conversion Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 4/6] s390/mm: Make PTC and UV call order consistent Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 5/6] s390/mm: Batch PTE updates in lazy MMU mode Alexander Gordeev
2026-04-15 15:01 ` [PATCH v2 6/6] s390/mm: Allow lazy MMU mode disabling Alexander Gordeev
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox