linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages
@ 2025-12-15 20:49 Ankur Arora
  2025-12-15 20:49 ` [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant Ankur Arora
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

[ Resending with the list this time. Some of the recipients might have
  received a duplicate copy of this email. Apologies. ]

This series adds clearing of contiguous page ranges for hugepages.

Major change over v9:
  - move clear_user_page(), clear_user_pages() into highmem.h and
    conditions both on the arch not defining clear_user_highpage().

Described in more detail in the changelog.

The series improves on the current page-at-a-time approach in two ways:

 - amortizes the per-page setup cost over a larger extent
 - when using string instructions, exposes the real region size
   to the processor.

A processor could use knowledge of the full extent to optimize the
clearing better than if it sees only a single page sized extent at
a time. AMD Zen uarchs, as an example, elide cacheline allocation
for regions larger than LLC-size.

Demand faulting a 64GB region shows performance improvement:

 $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5

                       baseline              +series             change

                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*

   pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
   pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy

 [#] Notice that we perform much better with preempt=full|lazy. That's
  because preemptible models don't need explicit invocations of
  cond_resched() to ensure reasonable preemption latency, which,
  allows us to clear the full extent (1GB) in a single unit.
  In comparison the maximum extent used for preempt=none|voluntary is
  PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).

  The larger extent allows the processor to elide cacheline
  allocation (on Milan the threshold is LLC-size=32MB.) 

  (The hope is that eventually, in the fullness of time, the lazy
  preemption model will be able to do the same job that none or
  voluntary models are used for, allowing cond_resched() to go away.)

The anon-w-seq test in the vm-scalability benchmark, however, does show
worse performance with utime increasing by ~9%:

                         stime                  utime

  baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
  +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )

In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is a likely uncommon pattern where the memory bandwidth
is saturated while we are also cache limited because the workload
accesses the entire region.

Raghavendra also tested previous version of the series on AMD Genoa
also sees improvement [1] with preempt=lazy.
(The pg-sz=2MB improvement is much higher on Genoa than I see on
Milan):

  $ perf bench mem map -p $page-size -f populate -s 64GB -l 10

                    base               patched              change
   pg-sz=2MB       12.731939 GB/sec    26.304263 GB/sec     106.6%
   pg-sz=1GB       26.232423 GB/sec    61.174836 GB/sec     133.2%


Changelog:

v10:
 - Condition the definition of clear_user_page(), clear_user_pages()
   on the architecture code defining clear_user_highpage. This
   simplifies issues with architectures where some do not define
   clear_user_page, but define clear_user_highpage().

   Also, instead of splitting up across files, move them to linux/highmem.
   This gets rid of build errors while using clear_usre_pages() for
   architectures that use macro magic (such as sparc, m68k).
   (Suggested by Christophe Leroy).

 - thresh out some of the comments around the x86 clear_pages()
   definition (Suggested by Borislav Petkov and Mateusz Guzik).
 (https://lore.kernel.org/lkml/20251121202352.494700-1-ankur.a.arora@oracle.com/)

v9:
 - Define PROCESS_PAGES_NON_PREEMPT_BATCH in common code (instead of
   inheriting ARCH_PAGE_CONTIG_NR.)
    - Also document this in much greater detail as clearing pages
      needing a a constant dependent on the preemption model is
      facially quite odd.
     (Suggested by David Hildenbrand, Andrew Morton, Borislav Petkov.)

 - Switch architectural markers from __HAVE_ARCH_CLEAR_USER_PAGE (and
   similar) to clear_user_page etc. (Suggested by David Hildenbrand)

 - s/memzero_page_aligned_unrolled/__clear_pages_unrolled/
   (Suggested by Borislav Petkov.)
 - style, comment fixes
 (https://lore.kernel.org/lkml/20251027202109.678022-1-ankur.a.arora@oracle.com/)

v8:
 - make clear_user_highpages(), clear_user_pages() and clear_pages()
   more robust across architectures. (Thanks David!)
 - split up folio_zero_user() changes into ones for clearing contiguous
   regions and those for maintaining temporal locality since they have
   different performance profiles (Suggested by Andrew Morton.)
 - added Raghavendra's Reviewed-by, Tested-by.
 - get rid of nth_page()
 - perf related patches have been pulled already. Remove them.

v7:
 - interface cleanups, comments for clear_user_highpages(), clear_user_pages(),
   clear_pages().
 - fixed build errors flagged by kernel test robot
 (https://lore.kernel.org/lkml/20250917152418.4077386-1-ankur.a.arora@oracle.com/)

v6:
 - perf bench mem: update man pages and other cleanups (Namhyung Kim)
 - unify folio_zero_user() for HIGHMEM, !HIGHMEM options instead of
   working through a new config option (David Hildenbrand).
   - cleanups and simlification around that.
 (https://lore.kernel.org/lkml/20250902080816.3715913-1-ankur.a.arora@oracle.com/)

v5:
 - move the non HIGHMEM implementation of folio_zero_user() from x86
   to common code (Dave Hansen)
 - Minor naming cleanups, commit messages etc
 (https://lore.kernel.org/lkml/20250710005926.1159009-1-ankur.a.arora@oracle.com/)

v4:
 - adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
 - inline stosb etc (PeterZ)
 - handle cooperative preemption models (Ingo)
 - interface and other cleanups all over (Ingo)
 (https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)

v3:
 - get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
   was limited to preempt=full|lazy.
 - override folio_zero_user() (Linus)
 (https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)

v2:
 - addressed review comments from peterz, tglx.
 - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
 - General code cleanup
 (https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages.v7

[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/


Ankur Arora (7):
  highmem: introduce clear_user_highpages()
  mm: introduce clear_pages() and clear_user_pages()
  highmem: do range clearing in clear_user_highpages()
  x86/mm: Simplify clear_page_*
  x86/clear_page: Introduce clear_pages()
  mm, folio_zero_user: support clearing page ranges
  mm: folio_zero_user: cache neighbouring pages

David Hildenbrand (1):
  treewide: provide a generic clear_user_page() variant

 arch/alpha/include/asm/page.h      |  1 -
 arch/arc/include/asm/page.h        |  2 +
 arch/arm/include/asm/page-nommu.h  |  1 -
 arch/arm64/include/asm/page.h      |  1 -
 arch/csky/abiv1/inc/abi/page.h     |  1 +
 arch/csky/abiv2/inc/abi/page.h     |  7 ---
 arch/hexagon/include/asm/page.h    |  1 -
 arch/loongarch/include/asm/page.h  |  1 -
 arch/m68k/include/asm/page_no.h    |  1 -
 arch/microblaze/include/asm/page.h |  1 -
 arch/mips/include/asm/page.h       |  1 +
 arch/nios2/include/asm/page.h      |  1 +
 arch/openrisc/include/asm/page.h   |  1 -
 arch/parisc/include/asm/page.h     |  1 -
 arch/powerpc/include/asm/page.h    |  1 +
 arch/riscv/include/asm/page.h      |  1 -
 arch/s390/include/asm/page.h       |  1 -
 arch/sparc/include/asm/page_64.h   |  1 +
 arch/um/include/asm/page.h         |  1 -
 arch/x86/include/asm/page.h        |  6 --
 arch/x86/include/asm/page_32.h     |  6 ++
 arch/x86/include/asm/page_64.h     | 76 ++++++++++++++++++-----
 arch/x86/lib/clear_page_64.S       | 39 +++---------
 arch/xtensa/include/asm/page.h     |  1 -
 include/linux/highmem.h            | 97 +++++++++++++++++++++++++++++-
 include/linux/mm.h                 | 56 +++++++++++++++++
 mm/memory.c                        | 86 +++++++++++++++++++-------
 27 files changed, 298 insertions(+), 95 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
@ 2025-12-15 20:49 ` Ankur Arora
  2025-12-18  7:11   ` David Hildenbrand (Red Hat)
  2025-12-15 20:49 ` [PATCH v10 2/8] highmem: introduce clear_user_highpages() Ankur Arora
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora, David Hildenbrand

From: David Hildenbrand <david@redhat.com>

Let's drop all variants that effectively map to clear_page() and
provide it in a generic variant instead.

We'll use the macro clear_user_page to indicate whether an architecture
provides it's own variant. Maybe at some point these should be CONFIG_
options.

Also, clear_user_page() is only called from the generic variant of
clear_user_highpage(), so define it only if the architecture does
not provide a clear_user_highpage(). And, for simplicity define it
in linux/highmem.h.

Note that for parisc, clear_page() and clear_user_page() map to
clear_page_asm(), so we can just get rid of the custom clear_user_page()
implementation. There is a clear_user_page_asm() function on parisc,
that seems to be unused. Not sure what's up with that.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---

Notes:
  - incorporates Christophe Leroy's suggestion of moving
    clear_user_page() into linux/highmem.h.

    The linkage between clear_user_highpage() and clear_user_page()
    isn't completely obvious but I hope the comment above
    clear_user_page() clarifies it.

    Though I guess if HIGHMEM goes away, we might have to revisit it.

 arch/alpha/include/asm/page.h      |  1 -
 arch/arc/include/asm/page.h        |  2 ++
 arch/arm/include/asm/page-nommu.h  |  1 -
 arch/arm64/include/asm/page.h      |  1 -
 arch/csky/abiv1/inc/abi/page.h     |  1 +
 arch/csky/abiv2/inc/abi/page.h     |  7 -------
 arch/hexagon/include/asm/page.h    |  1 -
 arch/loongarch/include/asm/page.h  |  1 -
 arch/m68k/include/asm/page_no.h    |  1 -
 arch/microblaze/include/asm/page.h |  1 -
 arch/mips/include/asm/page.h       |  1 +
 arch/nios2/include/asm/page.h      |  1 +
 arch/openrisc/include/asm/page.h   |  1 -
 arch/parisc/include/asm/page.h     |  1 -
 arch/powerpc/include/asm/page.h    |  1 +
 arch/riscv/include/asm/page.h      |  1 -
 arch/s390/include/asm/page.h       |  1 -
 arch/sparc/include/asm/page_64.h   |  1 +
 arch/um/include/asm/page.h         |  1 -
 arch/x86/include/asm/page.h        |  6 ------
 arch/xtensa/include/asm/page.h     |  1 -
 include/linux/highmem.h            | 23 +++++++++++++++++++++--
 22 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index d2c6667d73e9..59d01f9b77f6 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -11,7 +11,6 @@
 #define STRICT_MM_TYPECHECKS
 
 extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 
 #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
 	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
index 9720fe6b2c24..38214e126c6d 100644
--- a/arch/arc/include/asm/page.h
+++ b/arch/arc/include/asm/page.h
@@ -32,6 +32,8 @@ struct page;
 
 void copy_user_highpage(struct page *to, struct page *from,
 			unsigned long u_vaddr, struct vm_area_struct *vma);
+
+#define clear_user_page clear_user_page
 void clear_user_page(void *to, unsigned long u_vaddr, struct page *page);
 
 typedef struct {
diff --git a/arch/arm/include/asm/page-nommu.h b/arch/arm/include/asm/page-nommu.h
index 7c2c72323d17..e74415c959be 100644
--- a/arch/arm/include/asm/page-nommu.h
+++ b/arch/arm/include/asm/page-nommu.h
@@ -11,7 +11,6 @@
 #define clear_page(page)	memset((page), 0, PAGE_SIZE)
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 /*
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 00f117ff4f7a..b39cc1127e1f 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -36,7 +36,6 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 bool tag_clear_highpages(struct page *to, int numpages);
 #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGES
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 typedef struct page *pgtable_t;
diff --git a/arch/csky/abiv1/inc/abi/page.h b/arch/csky/abiv1/inc/abi/page.h
index 2d2159933b76..58307254e7e5 100644
--- a/arch/csky/abiv1/inc/abi/page.h
+++ b/arch/csky/abiv1/inc/abi/page.h
@@ -10,6 +10,7 @@ static inline unsigned long pages_do_alias(unsigned long addr1,
 	return (addr1 ^ addr2) & (SHMLBA-1);
 }
 
+#define clear_user_page clear_user_page
 static inline void clear_user_page(void *addr, unsigned long vaddr,
 				   struct page *page)
 {
diff --git a/arch/csky/abiv2/inc/abi/page.h b/arch/csky/abiv2/inc/abi/page.h
index cf005f13cd15..a5a255013308 100644
--- a/arch/csky/abiv2/inc/abi/page.h
+++ b/arch/csky/abiv2/inc/abi/page.h
@@ -1,11 +1,4 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-
-static inline void clear_user_page(void *addr, unsigned long vaddr,
-				   struct page *page)
-{
-	clear_page(addr);
-}
-
 static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 				  struct page *page)
 {
diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
index 137ba7c5de48..f0aed3ed812b 100644
--- a/arch/hexagon/include/asm/page.h
+++ b/arch/hexagon/include/asm/page.h
@@ -113,7 +113,6 @@ static inline void clear_page(void *page)
 /*
  * Under assumption that kernel always "sees" user map...
  */
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 static inline unsigned long virt_to_pfn(const void *kaddr)
diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h
index 256d1ff7a1e3..327bf0bc92bf 100644
--- a/arch/loongarch/include/asm/page.h
+++ b/arch/loongarch/include/asm/page.h
@@ -30,7 +30,6 @@
 extern void clear_page(void *page);
 extern void copy_page(void *to, void *from);
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 extern unsigned long shm_align_mask;
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index 39db2026a4b4..d2532bc407ef 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -10,7 +10,6 @@ extern unsigned long memory_end;
 #define clear_page(page)	memset((page), 0, PAGE_SIZE)
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
index 90ac9f34b4b4..e1e396367ba7 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -45,7 +45,6 @@ typedef unsigned long pte_basic_t;
 # define copy_page(to, from)			memcpy((to), (from), PAGE_SIZE)
 # define clear_page(pgaddr)			memset((pgaddr), 0, PAGE_SIZE)
 
-# define clear_user_page(pgaddr, vaddr, page)	memset((pgaddr), 0, PAGE_SIZE)
 # define copy_user_page(vto, vfrom, vaddr, topg) \
 			memcpy((vto), (vfrom), PAGE_SIZE)
 
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index bc3e3484c1bf..5ec428fcc887 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -90,6 +90,7 @@ static inline void clear_user_page(void *addr, unsigned long vaddr,
 	if (pages_do_alias((unsigned long) addr, vaddr & PAGE_MASK))
 		flush_data_cache_page((unsigned long)addr);
 }
+#define clear_user_page clear_user_page
 
 struct vm_area_struct;
 extern void copy_user_highpage(struct page *to, struct page *from,
diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
index 00a51623d38a..722956ac0bf8 100644
--- a/arch/nios2/include/asm/page.h
+++ b/arch/nios2/include/asm/page.h
@@ -45,6 +45,7 @@
 
 struct page;
 
+#define clear_user_page clear_user_page
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 extern void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
 				struct page *to);
diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
index 85797f94d1d7..d2cdbf3579bb 100644
--- a/arch/openrisc/include/asm/page.h
+++ b/arch/openrisc/include/asm/page.h
@@ -30,7 +30,6 @@
 #define clear_page(page)	memset((page), 0, PAGE_SIZE)
 #define copy_page(to, from)	memcpy((to), (from), PAGE_SIZE)
 
-#define clear_user_page(page, vaddr, pg)        clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)     copy_page(to, from)
 
 /*
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index 8f4e51071ea1..3630b36d07da 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -21,7 +21,6 @@ struct vm_area_struct;
 
 void clear_page_asm(void *page);
 void copy_page_asm(void *to, void *from);
-#define clear_user_page(vto, vaddr, page) clear_page_asm(vto)
 void copy_user_highpage(struct page *to, struct page *from, unsigned long vaddr,
 		struct vm_area_struct *vma);
 #define __HAVE_ARCH_COPY_USER_HIGHPAGE
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index b28fbb1d57eb..f2bb1f98eebe 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -271,6 +271,7 @@ static inline const void *pfn_to_kaddr(unsigned long pfn)
 
 struct page;
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
+#define clear_user_page clear_user_page
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
 		struct page *p);
 extern int devmem_is_allowed(unsigned long pfn);
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index ffe213ad65a4..061b60b954ec 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -50,7 +50,6 @@ void clear_page(void *page);
 #endif
 #define copy_page(to, from)			memcpy((to), (from), PAGE_SIZE)
 
-#define clear_user_page(pgaddr, vaddr, page)	clear_page(pgaddr)
 #define copy_user_page(vto, vfrom, vaddr, topg) \
 			memcpy((vto), (vfrom), PAGE_SIZE)
 
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index c1d63b613bf9..9c8c5283258e 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -65,7 +65,6 @@ static inline void copy_page(void *to, void *from)
 		: : "memory", "cc");
 }
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index d764d8a8586b..fd4dc85fb38b 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -43,6 +43,7 @@ void _clear_page(void *page);
 #define clear_page(X)	_clear_page((void *)(X))
 struct page;
 void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
+#define clear_user_page clear_user_page
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
 void copy_user_page(void *to, void *from, unsigned long vaddr, struct page *topage);
 #define __HAVE_ARCH_COPY_USER_HIGHPAGE
diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 2d363460d896..e348ff489b89 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -26,7 +26,6 @@ struct page;
 #define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 typedef struct { unsigned long pte; } pte_t;
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 9265f2fca99a..416dc88e35c1 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -22,12 +22,6 @@ struct page;
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
 
-static inline void clear_user_page(void *page, unsigned long vaddr,
-				   struct page *pg)
-{
-	clear_page(page);
-}
-
 static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 				  struct page *topage)
 {
diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 20655174b111..059493256765 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -126,7 +126,6 @@ void clear_user_highpage(struct page *page, unsigned long vaddr);
 void copy_user_highpage(struct page *to, struct page *from,
 			unsigned long vaddr, struct vm_area_struct *vma);
 #else
-# define clear_user_page(page, vaddr, pg)	clear_page(page)
 # define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #endif
 
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index abc20f9810fd..9a38512b8000 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -197,15 +197,34 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
 }
 #endif
 
-/* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
 #ifndef clear_user_highpage
+#ifndef clear_user_page
+/**
+ * clear_user_page() - clear a page to be mapped to user space
+ * @addr: the address of the page
+ * @vaddr: the address of the user mapping
+ * @page: the page
+ *
+ * We condition the definition of clear_user_page() on the architecture not
+ * having a custom clear_user_highpage(). That's because if there is some
+ * special flushing needed for clear_user_highpage() then it is likely that
+ * clear_user_page() also needs some magic. And, since our only caller
+ * is the generic clear_user_highpage(), not defining is not much of a loss.
+ */
+static inline void clear_user_page(void *addr, unsigned long vaddr, struct page *page)
+{
+	clear_page(addr);
+}
+#endif
+
+/* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 {
 	void *addr = kmap_local_page(page);
 	clear_user_page(addr, vaddr, page);
 	kunmap_local(addr);
 }
-#endif
+#endif /* clear_user_highpage */
 
 #ifndef vma_alloc_zeroed_movable_folio
 /**
-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v10 2/8] highmem: introduce clear_user_highpages()
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
  2025-12-15 20:49 ` [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant Ankur Arora
@ 2025-12-15 20:49 ` Ankur Arora
  2025-12-15 20:49 ` [PATCH v10 3/8] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Define clear_user_highpages() which clears pages sequentially using
clear_user_highpage().

Also add a comment describing clear_user_highpage().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/highmem.h | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 9a38512b8000..9187bfaa709d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -217,7 +217,14 @@ static inline void clear_user_page(void *addr, unsigned long vaddr, struct page
 }
 #endif
 
-/* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
+/**
+ * clear_user_highpage() - clear a page to be mapped to user space
+ * @page: start page
+ * @vaddr: start address of the user mapping
+ *
+ * With !CONFIG_HIGHMEM this (and the copy_user_highpage() below) will
+ * be plain clear_user_page() (and copy_user_page()).
+ */
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 {
 	void *addr = kmap_local_page(page);
@@ -226,6 +233,25 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 }
 #endif /* clear_user_highpage */
 
+/**
+ * clear_user_highpages() - clear a page range to be mapped to user space
+ * @page: start page
+ * @vaddr: start address of the user mapping
+ * @npages: number of pages
+ *
+ * Assumes that all the pages in the region (@page, +@npages) are valid
+ * so this does no exception handling.
+ */
+static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
+					unsigned int npages)
+{
+	do {
+		clear_user_highpage(page, vaddr);
+		vaddr += PAGE_SIZE;
+		page++;
+	} while (--npages);
+}
+
 #ifndef vma_alloc_zeroed_movable_folio
 /**
  * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v10 3/8] mm: introduce clear_pages() and clear_user_pages()
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
  2025-12-15 20:49 ` [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant Ankur Arora
  2025-12-15 20:49 ` [PATCH v10 2/8] highmem: introduce clear_user_highpages() Ankur Arora
@ 2025-12-15 20:49 ` Ankur Arora
  2025-12-15 20:49 ` [PATCH v10 4/8] highmem: do range clearing in clear_user_highpages() Ankur Arora
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Introduce clear_pages(), to be overridden by architectures that
support more efficient clearing of consecutive pages.

Also introduce clear_user_pages(), however, we will not expect this
function to be overridden anytime soon. 

As we do for clear_user_page(). define clear_user_pages() only if the
architecture does not define clear_user_highpage().

That is because if the architecture does define clear_user_highpage(),
then it likely needs some flushing magic when clearing user pages or
highpages. This means we can get away without defining clear_user_pages(),
since, much like its single page sibling, its only potential user is the
generic clear_user_highpages() which should instead be using
clear_user_highpage().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---

Note:
  - reorganize based on Christophe Leroy's suggestion.
  - Dropped David's ack.

 include/linux/highmem.h | 33 +++++++++++++++++++++++++++++++++
 include/linux/mm.h      | 20 ++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 9187bfaa709d..92aa1053c9c1 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -217,6 +217,39 @@ static inline void clear_user_page(void *addr, unsigned long vaddr, struct page
 }
 #endif
 
+/**
+ * clear_user_pages() - clear a page range to be mapped to user space
+ * @addr: start address
+ * @vaddr: start address of the user mapping
+ * @page: start page
+ * @npages: number of pages
+ *
+ * Assumes that the region (@addr, +@npages) has been validated
+ * already so this does no exception handling.
+ *
+ * If the architecture provides a clear_user_page(), use that;
+ * otherwise, we can safely use clear_pages().
+ */
+static inline void clear_user_pages(void *addr, unsigned long vaddr,
+		struct page *page, unsigned int npages)
+{
+
+#ifdef clear_user_page
+	do {
+		clear_user_page(addr, vaddr, page);
+		addr += PAGE_SIZE;
+		vaddr += PAGE_SIZE;
+		page++;
+	} while (--npages);
+#else
+	/*
+	 * Prefer clear_pages() to allow for architectural optimizations
+	 * when operating on contiguous page ranges.
+	 */
+	clear_pages(addr, npages);
+#endif
+}
+
 /**
  * clear_user_highpage() - clear a page to be mapped to user space
  * @page: start page
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 15076261d0c2..12106ebf1a50 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4194,6 +4194,26 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order) {}
 #endif	/* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef clear_pages
+/**
+ * clear_pages() - clear a page range for kernel-internal use.
+ * @addr: start address
+ * @npages: number of pages
+ *
+ * Use clear_user_pages() instead when clearing a page range to be
+ * mapped to user space.
+ *
+ * Does absolutely no exception handling.
+ */
+static inline void clear_pages(void *addr, unsigned int npages)
+{
+	do {
+		clear_page(addr);
+		addr += PAGE_SIZE;
+	} while (--npages);
+}
+#endif
+
 #ifdef __HAVE_ARCH_GATE_AREA
 extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
 extern int in_gate_area_no_mm(unsigned long addr);
-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v10 4/8] highmem: do range clearing in clear_user_highpages()
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (2 preceding siblings ...)
  2025-12-15 20:49 ` [PATCH v10 3/8] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
@ 2025-12-15 20:49 ` Ankur Arora
  2025-12-18  7:15   ` David Hildenbrand (Red Hat)
  2025-12-15 20:49 ` [PATCH v10 5/8] x86/mm: Simplify clear_page_* Ankur Arora
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Use the range clearing primitive clear_user_pages() when clearing
contiguous pages in clear_user_highpages().

We can safely do that when we have !CONFIG_HIGHMEM and when the
architecture does not have clear_user_highpage.

The first is necessary because not doing intermediate maps for
pages lets contiguous page ranges stay contiguous. The second,
because if the architecture has clear_user_highpage(), it likely
needs flushing magic when clearing the page, magic that we aren't
privy to.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
Note:
  - reorganized based on the previous two patches. 
  - Removed David's acked-by.

 include/linux/highmem.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 92aa1053c9c1..c6219700569f 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -278,11 +278,28 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
 					unsigned int npages)
 {
+
+#if defined(clear_user_highpage) || defined(CONFIG_HIGHMEM)
+	/*
+	 * An architecture defined clear_user_highpage() implies special
+	 * handling is needed.
+	 *
+	 * So we use that or, the generic variant if CONFIG_HIGHMEM is
+	 * enabled.
+	 */
 	do {
 		clear_user_highpage(page, vaddr);
 		vaddr += PAGE_SIZE;
 		page++;
 	} while (--npages);
+#else
+
+	/*
+	 * Prefer clear_user_pages() to allow for architectural optimizations
+	 * when operating on contiguous page ranges.
+	 */
+	clear_user_pages(page_address(page), vaddr, page, npages);
+#endif
 }
 
 #ifndef vma_alloc_zeroed_movable_folio
-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v10 5/8] x86/mm: Simplify clear_page_*
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (3 preceding siblings ...)
  2025-12-15 20:49 ` [PATCH v10 4/8] highmem: do range clearing in clear_user_highpages() Ankur Arora
@ 2025-12-15 20:49 ` Ankur Arora
  2025-12-15 20:49 ` [PATCH v10 6/8] x86/clear_page: Introduce clear_pages() Ankur Arora
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
variations. Inlining gets rid of an unnecessary CALL/RET (which isn't
free when using RETHUNK speculative execution mitigations.)
Fixup and rename clear_page_orig() to adapt to the changed calling
convention.

Also add a comment from Dave Hansen detailing various clearing mechanisms
used in clear_page().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
---

Notes:
  - addresses comments from Borislav Petkov and Mateusz Guzik, threshing
    out some of the details in the comments.

 arch/x86/include/asm/page_32.h |  6 +++
 arch/x86/include/asm/page_64.h | 67 ++++++++++++++++++++++++++--------
 arch/x86/lib/clear_page_64.S   | 39 ++++----------------
 3 files changed, 66 insertions(+), 46 deletions(-)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 0c623706cb7e..19fddb002cc9 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -17,6 +17,12 @@ extern unsigned long __phys_addr(unsigned long);
 
 #include <linux/string.h>
 
+/**
+ * clear_page() - clear a page using a kernel virtual address.
+ * @page: address of kernel page
+ *
+ * Does absolutely no exception handling.
+ */
 static inline void clear_page(void *page)
 {
 	memset(page, 0, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 2f0e47be79a4..ec3307234a17 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -48,26 +48,63 @@ static inline unsigned long __phys_addr_symbol(unsigned long x)
 
 #define __phys_reloc_hide(x)	(x)
 
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
-KCFI_REFERENCE(clear_page_orig);
-KCFI_REFERENCE(clear_page_rep);
-KCFI_REFERENCE(clear_page_erms);
+void __clear_pages_unrolled(void *page);
+KCFI_REFERENCE(__clear_pages_unrolled);
 
-static inline void clear_page(void *page)
+/**
+ * clear_page() - clear a page using a kernel virtual address.
+ * @addr: address of kernel page
+ *
+ * Switch between three implementations of page clearing based on CPU
+ * capabilities:
+ *
+ *  - __clear_pages_unrolled(): the oldest, slowest and universally
+ *    supported method. Zeroes via 8-byte MOV instructions unrolled 8x
+ *    to write a 64-byte cacheline in each loop iteration.
+ *
+ *  - "REP; STOSQ": really old CPUs had crummy REP implementations.
+ *    Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
+ *    trusted. The instruction writes 8-byte per REP iteration but
+ *    CPUs can internally batch these together and do larger writes.
+ *
+ *  - "REP; STOSB": used on CPUs with "enhanced REP MOVSB/STOSB",
+ *    which enumerate 'ERMS' and provide an implementation which
+ *    unlike "REP; STOSQ" above wasn't overly picky about alignment.
+ *    The instruction writes 1-byte per REP iteration with CPUs
+ *    internally batching these together into larger writes and is
+ *    generally fastest of the three.
+ *
+ * Note that when running as a guest, features exposed by the CPU
+ * might be mediated by the hypervisor. So, the STOSQ variant might
+ * be in active use on some systems even when the hardware enumerates
+ * ERMS.
+ *
+ * Does absolutely no exception handling.
+ */
+static inline void clear_page(void *addr)
 {
+	u64 len = PAGE_SIZE;
 	/*
 	 * Clean up KMSAN metadata for the page being cleared. The assembly call
-	 * below clobbers @page, so we perform unpoisoning before it.
+	 * below clobbers @addr, so perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, PAGE_SIZE);
-	alternative_call_2(clear_page_orig,
-			   clear_page_rep, X86_FEATURE_REP_GOOD,
-			   clear_page_erms, X86_FEATURE_ERMS,
-			   "=D" (page),
-			   "D" (page),
-			   "cc", "memory", "rax", "rcx");
+	kmsan_unpoison_memory(addr, len);
+
+	/*
+	 * The inline asm embeds a CALL instruction and usually that is a no-no
+	 * due to the compiler not knowing that and thus being unable to track
+	 * callee-clobbered registers.
+	 *
+	 * In this case that is fine because the registers clobbered by
+	 * __clear_pages_unrolled() are part of the inline asm register
+	 * specification.
+	 */
+	asm volatile(ALTERNATIVE_2("call __clear_pages_unrolled",
+				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
+				   "rep stosb", X86_FEATURE_ERMS)
+			: "+c" (len), "+D" (addr), ASM_CALL_CONSTRAINT
+			: "a" (0)
+			: "cc", "memory");
 }
 
 void copy_page(void *to, void *from);
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index a508e4a8c66a..f7f356e7218b 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -6,30 +6,15 @@
 #include <asm/asm.h>
 
 /*
- * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
- * recommended to use this when possible and we do use them by default.
- * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
- * Otherwise, use original.
+ * Zero page aligned region.
+ * %rdi	- dest
+ * %rcx	- length
  */
-
-/*
- * Zero a page.
- * %rdi	- page
- */
-SYM_TYPED_FUNC_START(clear_page_rep)
-	movl $4096/8,%ecx
-	xorl %eax,%eax
-	rep stosq
-	RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
-
-SYM_TYPED_FUNC_START(clear_page_orig)
-	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+SYM_TYPED_FUNC_START(__clear_pages_unrolled)
+	shrq   $6, %rcx
 	.p2align 4
 .Lloop:
-	decl	%ecx
+	decq	%rcx
 #define PUT(x) movq %rax,x*8(%rdi)
 	movq %rax,(%rdi)
 	PUT(1)
@@ -43,16 +28,8 @@ SYM_TYPED_FUNC_START(clear_page_orig)
 	jnz	.Lloop
 	nop
 	RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
-
-SYM_TYPED_FUNC_START(clear_page_erms)
-	movl $4096,%ecx
-	xorl %eax,%eax
-	rep stosb
-	RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(__clear_pages_unrolled)
+EXPORT_SYMBOL_GPL(__clear_pages_unrolled)
 
 /*
  * Default clear user-space.
-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v10 6/8] x86/clear_page: Introduce clear_pages()
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (4 preceding siblings ...)
  2025-12-15 20:49 ` [PATCH v10 5/8] x86/mm: Simplify clear_page_* Ankur Arora
@ 2025-12-15 20:49 ` Ankur Arora
  2025-12-18  7:22   ` David Hildenbrand (Red Hat)
  2025-12-15 20:49 ` [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges Ankur Arora
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Performance when clearing with string instructions (x86-64-stosq and
similar) can vary significantly based on the chunk-size used.

  $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      13.748208 GB/sec

  $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in
  # arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      15.067900 GB/sec

  $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      38.104311 GB/sec

(Both on AMD Milan.)

With a change in chunk-size from 4KB to 1GB, we see the performance go
from 13.7 GB/sec to 38.1 GB/sec. For the chunk-size of 2MB the change isn't
quite as drastic but it is worth adding a clear_page() variant that can
handle contiguous page-extents.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 arch/x86/include/asm/page_64.h | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index ec3307234a17..1895c207f629 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -52,8 +52,9 @@ void __clear_pages_unrolled(void *page);
 KCFI_REFERENCE(__clear_pages_unrolled);
 
 /**
- * clear_page() - clear a page using a kernel virtual address.
- * @addr: address of kernel page
+ * clear_pages() - clear a page range using a kernel virtual address.
+ * @addr: start address of kernel page range
+ * @npages: number of pages
  *
  * Switch between three implementations of page clearing based on CPU
  * capabilities:
@@ -81,11 +82,11 @@ KCFI_REFERENCE(__clear_pages_unrolled);
  *
  * Does absolutely no exception handling.
  */
-static inline void clear_page(void *addr)
+static inline void clear_pages(void *addr, unsigned int npages)
 {
-	u64 len = PAGE_SIZE;
+	u64 len = npages * PAGE_SIZE;
 	/*
-	 * Clean up KMSAN metadata for the page being cleared. The assembly call
+	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
 	 * below clobbers @addr, so perform unpoisoning before it.
 	 */
 	kmsan_unpoison_memory(addr, len);
@@ -106,6 +107,12 @@ static inline void clear_page(void *addr)
 			: "a" (0)
 			: "cc", "memory");
 }
+#define clear_pages clear_pages
+
+static inline void clear_page(void *addr)
+{
+	clear_pages(addr, 1);
+}
 
 void copy_page(void *to, void *from);
 KCFI_REFERENCE(copy_page);
-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (5 preceding siblings ...)
  2025-12-15 20:49 ` [PATCH v10 6/8] x86/clear_page: Introduce clear_pages() Ankur Arora
@ 2025-12-15 20:49 ` Ankur Arora
  2025-12-16  2:44   ` Andrew Morton
  2025-12-18  7:36   ` David Hildenbrand (Red Hat)
  2025-12-15 20:49 ` [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages Ankur Arora
  2025-12-16  2:48 ` [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Andrew Morton
  8 siblings, 2 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Clear contiguous page ranges in folio_zero_user() instead of clearing
a single page at a time. Exposing larger ranges enables extent based
processor optimizations.

However, because the underlying clearing primitives do not, or might
not be able to check to call cond_resched() to check if preemption
is required, limit the worst case preemption latency by doing the
clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.

For architectures that define clear_pages(), we assume that the
clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
worth of pages. This should be large enough to allow the processor
to optimize the operation and yet small enough that we see reasonable
preemption latency for when this optimization is not possible
(ex. slow microarchitectures, memory bandwidth saturation.)

Architectures that don't define clear_pages() will continue to use
the base value (single page). And, preemptible models don't need
invocations of cond_resched() so don't care about the batch size.

The resultant performance depends on the kinds of optimizations
available to the CPU for the region size being cleared. Two classes
of optimizations:

  - clearing iteration costs are amortized over a range larger
    than a single page.
  - cacheline allocation elision (seen on AMD Zen models).

Testing a demand fault workload shows an improved baseline from the
first optimization and a larger improvement when the region being
cleared is large enough for the second optimization.

AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

  $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5

                    page-at-a-time     contiguous clearing      change

                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*

   pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
   pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy

 [#] Notice that we perform much better with preempt=full|lazy. As
  mentioned above, preemptible models not needing explicit invocations
  of cond_resched() allow clearing of the full extent (1GB) as a
  single unit.
  In comparison the maximum extent used for preempt=none|voluntary is
  PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).

  The larger extent allows the processor to elide cacheline
  allocation (on Milan the threshold is LLC-size=32MB.)

Also as mentioned earlier, the baseline improvement is not specific to
AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
improvement as the Milan pg-sz=2MB workload above (~30%).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h | 38 +++++++++++++++++++++++++++++++++++++-
 mm/memory.c        | 46 +++++++++++++++++++++++++---------------------
 2 files changed, 62 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 12106ebf1a50..45e5e0ef620c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4194,7 +4194,6 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order) {}
 #endif	/* CONFIG_DEBUG_PAGEALLOC */
 
-#ifndef clear_pages
 /**
  * clear_pages() - clear a page range for kernel-internal use.
  * @addr: start address
@@ -4204,7 +4203,18 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
  * mapped to user space.
  *
  * Does absolutely no exception handling.
+ *
+ * Note that even though the clearing operation is preemptible, clear_pages()
+ * does not (and on architectures where it reduces to a few long-running
+ * instructions, might not be able to) call cond_resched() to check if
+ * rescheduling is required.
+ *
+ * When running under preemptible models this is fine, since clear_pages(),
+ * even when reduced to long-running instructions, is preemptible.
+ * Under cooperatively scheduled models, however, the caller is expected to
+ * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
  */
+#ifndef clear_pages
 static inline void clear_pages(void *addr, unsigned int npages)
 {
 	do {
@@ -4214,6 +4224,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
 }
 #endif
 
+#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
+#ifdef clear_pages
+/*
+ * The architecture defines clear_pages(), and we assume that it is
+ * generally "fast". So choose a batch size large enough to allow the processor
+ * headroom for optimizing the operation and yet small enough that we see
+ * reasonable preemption latency for when this optimization is not possible
+ * (ex. slow microarchitectures, memory bandwidth saturation.)
+ *
+ * With a value of 8MB and assuming a memory bandwidth of ~10GBps, this should
+ * result in worst case preemption latency of around 1ms when clearing pages.
+ *
+ * (See comment above clear_pages() for why preemption latency is a concern
+ * here.)
+ */
+#define PROCESS_PAGES_NON_PREEMPT_BATCH		(8 << (20 - PAGE_SHIFT))
+#else /* !clear_pages */
+/*
+ * The architecture does not provide a clear_pages() implementation. Assume
+ * that clear_page() -- which clear_pages() will fallback to -- is relatively
+ * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
+ */
+#define PROCESS_PAGES_NON_PREEMPT_BATCH		1
+#endif
+#endif
+
 #ifdef __HAVE_ARCH_GATE_AREA
 extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
 extern int in_gate_area_no_mm(unsigned long addr);
diff --git a/mm/memory.c b/mm/memory.c
index 2a55edc48a65..974c48db6089 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7237,40 +7237,44 @@ static inline int process_huge_page(
 	return 0;
 }
 
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
-				unsigned int nr_pages)
+static void clear_contig_highpages(struct page *page, unsigned long addr,
+				   unsigned int npages)
 {
-	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
-	int i;
+	unsigned int i, count, unit;
 
-	might_sleep();
-	for (i = 0; i < nr_pages; i++) {
+	/*
+	 * When clearing we want to operate on the largest extent possible since
+	 * that allows for extent based architecture specific optimizations.
+	 *
+	 * However, since the clearing interfaces (clear_user_highpages(),
+	 * clear_user_pages(), clear_pages()), do not call cond_resched(), we
+	 * limit the batch size when running under non-preemptible scheduling
+	 * models.
+	 */
+	unit = preempt_model_preemptible() ? npages : PROCESS_PAGES_NON_PREEMPT_BATCH;
+
+	for (i = 0; i < npages; i += count) {
 		cond_resched();
-		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
+
+		count = min(unit, npages - i);
+		clear_user_highpages(page + i,
+				     addr + i * PAGE_SIZE, count);
 	}
 }
 
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
-	struct folio *folio = arg;
-
-	clear_user_highpage(folio_page(folio, idx), addr);
-	return 0;
-}
-
 /**
  * folio_zero_user - Zero a folio which will be mapped to userspace.
  * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support to clear page ranges.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
-	unsigned int nr_pages = folio_nr_pages(folio);
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
 
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
-	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+	clear_contig_highpages(folio_page(folio, 0),
+				base_addr, folio_nr_pages(folio));
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (6 preceding siblings ...)
  2025-12-15 20:49 ` [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges Ankur Arora
@ 2025-12-15 20:49 ` Ankur Arora
  2025-12-18  7:49   ` David Hildenbrand (Red Hat)
  2025-12-16  2:48 ` [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Andrew Morton
  8 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-15 20:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	tglx, willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

folio_zero_user() does straight zeroing without caring about
temporal locality for caches.

This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
algorithm into a separate function") where we cleared a page at a
time converging to the faulting page from the left and the right.

To retain limited temporal locality, split the clearing in three
parts: the faulting page and its immediate neighbourhood, and, the
remaining regions on the left and the right. The local neighbourhood
will be cleared last.
Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
there isn't much expectation of cache locality for large folios.

Performance
===

AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
  memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)

anon-w-seq (vm-scalability):
                            stime                  utime

  page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
  contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
  neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )

Both stime and utime respond in expected ways. stime drops for both
contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
approaches. However, utime increases for both contiguous clearing
(+19.7%) and neighbourhood-last (+9.28%).

In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is likely an uncommon pattern where the memory bandwidth
is saturated while also being cache limited because we access the
entire region.

Kernel make workload (make -j 12 bzImage):

                            stime                  utime

  page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
  contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
  neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )

For make the utime stays relatively flat with an up to 4.9% improvement
in the stime.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 974c48db6089..d22348b95227 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
  * @addr_hint: The address accessed by the user or the base address.
  *
  * Uses architectural support to clear page ranges.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood is cleared last in order to keep cache
+ * lines of the faulting region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
 	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
 
-	clear_contig_highpages(folio_page(folio, 0),
-				base_addr, folio_nr_pages(folio));
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_contig_highpages(folio_page(folio, 0),
+				       base_addr, folio_nr_pages(folio));
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		unsigned int npages = range_len(&r[i]);
+		struct page *page = folio_page(folio, r[i].start);
+		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
+
+		if (npages > 0)
+			clear_contig_highpages(page, addr, npages);
+	}
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.31.1



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-15 20:49 ` [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges Ankur Arora
@ 2025-12-16  2:44   ` Andrew Morton
  2025-12-16  6:49     ` Ankur Arora
  2025-12-18  7:36   ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2025-12-16  2:44 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, david, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, tglx, willy, raghavendra.kt, chleroy,
	ioworker0, boris.ostrovsky, konrad.wilk

On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Clear contiguous page ranges in folio_zero_user() instead of clearing
> a single page at a time. Exposing larger ranges enables extent based
> processor optimizations.
> 
> However, because the underlying clearing primitives do not, or might
> not be able to check to call cond_resched() to check if preemption
> is required, limit the worst case preemption latency by doing the
> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
> 
> For architectures that define clear_pages(), we assume that the
> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
> worth of pages. This should be large enough to allow the processor
> to optimize the operation and yet small enough that we see reasonable
> preemption latency for when this optimization is not possible
> (ex. slow microarchitectures, memory bandwidth saturation.)
> 
> Architectures that don't define clear_pages() will continue to use
> the base value (single page). And, preemptible models don't need
> invocations of cond_resched() so don't care about the batch size.
> 
> The resultant performance depends on the kinds of optimizations
> available to the CPU for the region size being cleared. Two classes
> of optimizations:
> 
>   - clearing iteration costs are amortized over a range larger
>     than a single page.
>   - cacheline allocation elision (seen on AMD Zen models).

8MB is a big chunk of memory.

> Testing a demand fault workload shows an improved baseline from the
> first optimization and a larger improvement when the region being
> cleared is large enough for the second optimization.
> 
> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

So we break out of the copy to run cond_resched() 8192 times?  This sounds
like a minor cost.

>   $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
> 
>                     page-at-a-time     contiguous clearing      change
> 
>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
> 
>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy

And yet those 8192 cond_resched()'s have a huge impact on the
performance!  I find this result very surprising.  Is it explainable?

>  [#] Notice that we perform much better with preempt=full|lazy. As
>   mentioned above, preemptible models not needing explicit invocations
>   of cond_resched() allow clearing of the full extent (1GB) as a
>   single unit.
>   In comparison the maximum extent used for preempt=none|voluntary is
>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
> 
>   The larger extent allows the processor to elide cacheline
>   allocation (on Milan the threshold is LLC-size=32MB.)

It is this?

> Also as mentioned earlier, the baseline improvement is not specific to
> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
> improvement as the Milan pg-sz=2MB workload above (~30%).
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages
  2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (7 preceding siblings ...)
  2025-12-15 20:49 ` [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages Ankur Arora
@ 2025-12-16  2:48 ` Andrew Morton
  2025-12-16  5:04   ` Ankur Arora
  8 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2025-12-16  2:48 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, david, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, tglx, willy, raghavendra.kt, chleroy,
	ioworker0, boris.ostrovsky, konrad.wilk

On Mon, 15 Dec 2025 12:49:14 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> This series adds clearing of contiguous page ranges for hugepages.
> 

Review is a bit thin at this time but it's at v10 and seems a thing we
want to push ahead with.

So I'll add this to mm.git's mm-new branch for testing.  I expect a few
days after that I'll quietly move it into the mm-unstable branch where
it will receive linux-next exposure.  Further progress into mm.git's
non-rebasing for-next-merge-window mm-stable branch will depend upon
test and review outcomes.

Thanks.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages
  2025-12-16  2:48 ` [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Andrew Morton
@ 2025-12-16  5:04   ` Ankur Arora
  2025-12-18  7:38     ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-16  5:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk


Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 15 Dec 2025 12:49:14 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> This series adds clearing of contiguous page ranges for hugepages.
>>
>
> Review is a bit thin at this time but it's at v10 and seems a thing we
> want to push ahead with.

I think David Hildebrand has looked at all the HIGHMEM patches quite closely
(though, not this iteration where I reorganized things a bit.)

> So I'll add this to mm.git's mm-new branch for testing.  I expect a few
> days after that I'll quietly move it into the mm-unstable branch where
> it will receive linux-next exposure.  Further progress into mm.git's
> non-rebasing for-next-merge-window mm-stable branch will depend upon
> test and review outcomes.

Sounds great. Thanks!

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-16  2:44   ` Andrew Morton
@ 2025-12-16  6:49     ` Ankur Arora
  2025-12-16 15:12       ` Andrew Morton
  0 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-16  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk


Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a single page at a time. Exposing larger ranges enables extent based
>> processor optimizations.
>>
>> However, because the underlying clearing primitives do not, or might
>> not be able to check to call cond_resched() to check if preemption
>> is required, limit the worst case preemption latency by doing the
>> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
>>
>> For architectures that define clear_pages(), we assume that the
>> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
>> worth of pages. This should be large enough to allow the processor
>> to optimize the operation and yet small enough that we see reasonable
>> preemption latency for when this optimization is not possible
>> (ex. slow microarchitectures, memory bandwidth saturation.)
>>
>> Architectures that don't define clear_pages() will continue to use
>> the base value (single page). And, preemptible models don't need
>> invocations of cond_resched() so don't care about the batch size.
>>
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region size being cleared. Two classes
>> of optimizations:
>>
>>   - clearing iteration costs are amortized over a range larger
>>     than a single page.
>>   - cacheline allocation elision (seen on AMD Zen models).
>
> 8MB is a big chunk of memory.
>
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>>
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>
> So we break out of the copy to run cond_resched() 8192 times?  This sounds
> like a minor cost.
>
>>   $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>>
>>                     page-at-a-time     contiguous clearing      change
>>
>>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>
>>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
>>
>>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
>>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy
>
> And yet those 8192 cond_resched()'s have a huge impact on the
> performance!  I find this result very surprising.  Is it explainable?

I agree about this being surprising. On the 2MB extent, I still find the
30% quite high but I think a decent portion of it is:

  - on x86, the CPU is executing a single microcoded insn: REP; STOSB. And,
    because it's doing it for a 2MB instead of a bunch of 4K extents it
    saves the microcoding costs (and I suspect it allows it to do some
    range operation which also helps.)

  - the second reason (from Ingo) was again the per-iteration cost, which
    given all of the mitigations on x86 is quite substantial.

    On the AMD systems I had tested on, I think there's at least the cost
    of RET misprediction in there.

    (https://lore.kernel.org/lkml/Z_yzshvBmYiPrxU0@gmail.com/)

>>  [#] Notice that we perform much better with preempt=full|lazy. As
>>   mentioned above, preemptible models not needing explicit invocations
>>   of cond_resched() allow clearing of the full extent (1GB) as a
>>   single unit.
>>   In comparison the maximum extent used for preempt=none|voluntary is
>>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>>
>>   The larger extent allows the processor to elide cacheline
>>   allocation (on Milan the threshold is LLC-size=32MB.)
>
> It is this?

Yeah I think so. For size >= 32MB, the microcoder can really just elide
cacheline allocation, and with the foreknowledge of the extent can perhaps
optimize on cache coherence traffic (this last one is my speculation).

On cacheline allocation elision, compare the L1-dcache-load in the two versions
below:

pg-sz=1GB:
  -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
  -    544,878,976      instructions                     #    0.06  insn per cycle
  -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
  -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)

  +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
  +     10,979,121      instructions                     #    0.00  insn per cycle
  +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
  +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)

(From an earlier version of this series: https://lore.kernel.org/lkml/20250414034607.762653-5-ankur.a.arora@oracle.com/)

Maybe I should have kept it in this commit :).

>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>>

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-16  6:49     ` Ankur Arora
@ 2025-12-16 15:12       ` Andrew Morton
  2025-12-17  8:48         ` Ankur Arora
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2025-12-16 15:12 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, david, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, tglx, willy, raghavendra.kt, chleroy,
	ioworker0, boris.ostrovsky, konrad.wilk

On Mon, 15 Dec 2025 22:49:25 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> >>  [#] Notice that we perform much better with preempt=full|lazy. As
> >>   mentioned above, preemptible models not needing explicit invocations
> >>   of cond_resched() allow clearing of the full extent (1GB) as a
> >>   single unit.
> >>   In comparison the maximum extent used for preempt=none|voluntary is
> >>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
> >>
> >>   The larger extent allows the processor to elide cacheline
> >>   allocation (on Milan the threshold is LLC-size=32MB.)
> >
> > It is this?
> 
> Yeah I think so. For size >= 32MB, the microcoder can really just elide
> cacheline allocation, and with the foreknowledge of the extent can perhaps
> optimize on cache coherence traffic (this last one is my speculation).
> 
> On cacheline allocation elision, compare the L1-dcache-load in the two versions
> below:
> 
> pg-sz=1GB:
>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>   -    544,878,976      instructions                     #    0.06  insn per cycle
>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
> 
>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>   +     10,979,121      instructions                     #    0.00  insn per cycle
>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
> 

That says L1 d-cache loads went from 600 million/sec down to 20
million/sec when using 32MB chunks?

Do you know what happens to preemption latency if you increase that
chunk size from 8MB to 32MB?  At 42GB/sec, 32MB will take less than a
millisecond, yes?  I'm not aware of us really having any latency
targets in these preemption modes, but 1 millisecond sounds pretty
good.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-16 15:12       ` Andrew Morton
@ 2025-12-17  8:48         ` Ankur Arora
  2025-12-17 18:54           ` Andrew Morton
  0 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-17  8:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk


Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 15 Dec 2025 22:49:25 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> >>  [#] Notice that we perform much better with preempt=full|lazy. As
>> >>   mentioned above, preemptible models not needing explicit invocations
>> >>   of cond_resched() allow clearing of the full extent (1GB) as a
>> >>   single unit.
>> >>   In comparison the maximum extent used for preempt=none|voluntary is
>> >>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>> >>
>> >>   The larger extent allows the processor to elide cacheline
>> >>   allocation (on Milan the threshold is LLC-size=32MB.)
>> >
>> > It is this?
>>
>> Yeah I think so. For size >= 32MB, the microcoder can really just elide
>> cacheline allocation, and with the foreknowledge of the extent can perhaps
>> optimize on cache coherence traffic (this last one is my speculation).
>>
>> On cacheline allocation elision, compare the L1-dcache-load in the two versions
>> below:
>>
>> pg-sz=1GB:
>>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>>   -    544,878,976      instructions                     #    0.06  insn per cycle
>>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
>>
>>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>>   +     10,979,121      instructions                     #    0.00  insn per cycle
>>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
>>
>
> That says L1 d-cache loads went from 600 million/sec down to 20
> million/sec when using 32MB chunks?

Sorry, should have mentioned that that run was with preempt=full/lazy.
For those the chunk size is the whole page (GB page in that case).

The context for 32MB was that that's the LLC-size for these systems.
And, from observed behaviour the cacheline allocation elision
optimization only happens when the chunk size used is larger than that.

> Do you know what happens to preemption latency if you increase that
> chunk size from 8MB to 32MB?

So, I gathered some numbers on a Zen4/Genoa system. The ones above are
from Zen3/Milan.

region-sz=64GB, loop-count=3 (total region-size=3*64GB):

                                Bandwidth    L1-dcache-loads

    pg-sz=2MB, batch-sz= 8MB   25.10 GB/s    6,745,859,855  # 2.00 L1-dcache-loads/64B
       # pg-sz=2MB for context

    pg-sz=1GB, batch-sz= 8MB   26.88 GB/s    6,469,900,728  # 2.00 L1-dcache-loads/64B
    pg-sz=1GB, batch-sz=32MB   38.69 GB/s    2,559,249,546  # 0.79 L1-dcache-loads/64B
    pg-sz=1GB, batch-sz=64MB   46.91 GB/s      919,539,544  # 0.28 L1-dcache-loads/64B

    pg-sz=1GB, batch-sz= 1GB   58.68 GB/s       79,458,439  # 0.024 L1-dcache-loads/64B

All of these are for preempt=none, and with boost=0. (With boost=1 the
BW increases by ~25%.)

So, I wasn't quite right about the LLC-size=32MB being the threshold for
this optimization. There is a change in behaviour at that point but it
does improve beyond that.
(Ideally this threshold would be a processor MSR. That way we could
use this for 2MB pages as well. Oh well.)

> At 42GB/sec, 32MB will take less than a
> millisecond, yes?  I'm not aware of us really having any latency
> targets in these preemption modes, but 1 millisecond sounds pretty
> good.

Agreed. The only complaint threshold I see is 100ms (default value of
sysctl_resched_latency_warn_ms) which is pretty far from ~1ms.

And having a threshold of 32MB might benefit other applications since
we won't be discarding their cachelines in favour of filling up the
cache with zeroes.

I think the only problem cases might be slow uarchs and workloads where
the memory bus is saturated which might dilate the preemption latency.

And, even if the operation takes say ~20ms, that should still leave us
with a reasonably large margin.
(And, any latency senstive users are probably not running with
preempt=none/voluntary.)

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-17  8:48         ` Ankur Arora
@ 2025-12-17 18:54           ` Andrew Morton
  2025-12-17 19:51             ` Ankur Arora
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2025-12-17 18:54 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, david, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, tglx, willy, raghavendra.kt, chleroy,
	ioworker0, boris.ostrovsky, konrad.wilk

On Wed, 17 Dec 2025 00:48:50 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> > At 42GB/sec, 32MB will take less than a
> > millisecond, yes?  I'm not aware of us really having any latency
> > targets in these preemption modes, but 1 millisecond sounds pretty
> > good.
> 
> Agreed. The only complaint threshold I see is 100ms (default value of
> sysctl_resched_latency_warn_ms) which is pretty far from ~1ms.
> 
> And having a threshold of 32MB might benefit other applications since
> we won't be discarding their cachelines in favour of filling up the
> cache with zeroes.
> 
> I think the only problem cases might be slow uarchs and workloads where
> the memory bus is saturated which might dilate the preemption latency.
> 
> And, even if the operation takes say ~20ms, that should still leave us
> with a reasonably large margin.
> (And, any latency senstive users are probably not running with
> preempt=none/voluntary.)

So I think you're saying that yes, we should increase the chunk size?

If so, what's the timing on that?  It would be nice to do it in the
current -rc cycle for testing reasons and so the changelogs can be
updated to reflect the altered performance numbers.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-17 18:54           ` Andrew Morton
@ 2025-12-17 19:51             ` Ankur Arora
  2025-12-17 20:26               ` Andrew Morton
  0 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-17 19:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk


Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 17 Dec 2025 00:48:50 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> > At 42GB/sec, 32MB will take less than a
>> > millisecond, yes?  I'm not aware of us really having any latency
>> > targets in these preemption modes, but 1 millisecond sounds pretty
>> > good.
>>
>> Agreed. The only complaint threshold I see is 100ms (default value of
>> sysctl_resched_latency_warn_ms) which is pretty far from ~1ms.
>>
>> And having a threshold of 32MB might benefit other applications since
>> we won't be discarding their cachelines in favour of filling up the
>> cache with zeroes.
>>
>> I think the only problem cases might be slow uarchs and workloads where
>> the memory bus is saturated which might dilate the preemption latency.
>>
>> And, even if the operation takes say ~20ms, that should still leave us
>> with a reasonably large margin.
>> (And, any latency senstive users are probably not running with
>> preempt=none/voluntary.)
>
> So I think you're saying that yes, we should increase the chunk size?

Yeah.

> If so, what's the timing on that?  It would be nice to do it in the
> current -rc cycle for testing reasons and so the changelogs can be
> updated to reflect the altered performance numbers.

I can send out an updated version of this patch later today. I think the
only real change is updating the constant and perf stats motivating
the chunk size value of 32MB.

Anything else you also think needs doing for this?

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-17 19:51             ` Ankur Arora
@ 2025-12-17 20:26               ` Andrew Morton
  2025-12-18  0:51                 ` Ankur Arora
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2025-12-17 20:26 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, david, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, tglx, willy, raghavendra.kt, chleroy,
	ioworker0, boris.ostrovsky, konrad.wilk

On Wed, 17 Dec 2025 11:51:43 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> > If so, what's the timing on that?  It would be nice to do it in the
> > current -rc cycle for testing reasons and so the changelogs can be
> > updated to reflect the altered performance numbers.
> 
> I can send out an updated version of this patch later today. I think the
> only real change is updating the constant and perf stats motivating
> the chunk size value of 32MB.

Yep.  A tiny change wouldn't normally require a full resend, but fairly
widespread changelog updates would best be handled with a v11, please.

> Anything else you also think needs doing for this?

Nope.  Just lots of review, as always ;)

What's the story with architectures other that x86, btw?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-17 20:26               ` Andrew Morton
@ 2025-12-18  0:51                 ` Ankur Arora
  0 siblings, 0 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-18  0:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk,
	kristina.martsenko, catalin.marinas


[ Added Kristina and Catalin for the FEAT_MOPS question. ]

Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 17 Dec 2025 11:51:43 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> > If so, what's the timing on that?  It would be nice to do it in the
>> > current -rc cycle for testing reasons and so the changelogs can be
>> > updated to reflect the altered performance numbers.
>>
>> I can send out an updated version of this patch later today. I think the
>> only real change is updating the constant and perf stats motivating
>> the chunk size value of 32MB.
>
> Yep.  A tiny change wouldn't normally require a full resend, but fairly
> widespread changelog updates would best be handled with a v11, please.

True, it will need updates to patches 7 and 8.

Will send out a v11 after rerunning tests for both of those. Might take
a day or two but should be able to send it out this week.

>> Anything else you also think needs doing for this?
>
> Nope.  Just lots of review, as always ;)
>
> What's the story with architectures other that x86, btw?

The only other architecture I know of which has a similar range
primitive is arm64 (with FEAT_MOPS). That should be extendable to larger
page sizes.
Don't have any numbers on it though. It's only available after arm64 v8.7
which I should have access to next year.
(But maybe Kristina or Catalin have tried out clearing large ranges with
MOPS?)

Other than that, the only one I know of is powerpc which already uses a
primitive to zero a cacheline (DCBZ). Which seems quite similar to
CLZERO on AMD Zen systems (though CLZERO does uncached weakly ordered
writes so needs a store barrier at the end).

Just from googling powerpc's implementation seems to be pretty optimal
already so probably wouldn't gain much from larger chunk sizes and
removal of the cond_resched().

But, CLZERO performs on par (or better) than this "REP; STOS"
implementation especially for smaller extents. So maybe in the future
we could use it to improve the 2MB performance for AMD Zen.

IMO the fiddly part might be in deciding when the cost of not-caching
is higher than the speedup from not-caching.

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant
  2025-12-15 20:49 ` [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant Ankur Arora
@ 2025-12-18  7:11   ` David Hildenbrand (Red Hat)
  2025-12-18 19:31     ` Ankur Arora
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  7:11 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, tglx,
	willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk

On 12/15/25 21:49, Ankur Arora wrote:
> From: David Hildenbrand <david@redhat.com>
> 
> Let's drop all variants that effectively map to clear_page() and
> provide it in a generic variant instead.
> 
> We'll use the macro clear_user_page to indicate whether an architecture
> provides it's own variant. Maybe at some point these should be CONFIG_
> options.

Can we drop the second sentence?

> 
> Also, clear_user_page() is only called from the generic variant of
> clear_user_highpage(), so define it only if the architecture does
> not provide a clear_user_highpage(). And, for simplicity define it
> in linux/highmem.h.
> 
> Note that for parisc, clear_page() and clear_user_page() map to
> clear_page_asm(), so we can just get rid of the custom clear_user_page()
> implementation. There is a clear_user_page_asm() function on parisc,
> that seems to be unused. Not sure what's up with that.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

You should likely now add

Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>

above your SB :)

> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Skimmed over it and nothing jumped at me.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 4/8] highmem: do range clearing in clear_user_highpages()
  2025-12-15 20:49 ` [PATCH v10 4/8] highmem: do range clearing in clear_user_highpages() Ankur Arora
@ 2025-12-18  7:15   ` David Hildenbrand (Red Hat)
  2025-12-18 20:01     ` Ankur Arora
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  7:15 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, tglx,
	willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk

On 12/15/25 21:49, Ankur Arora wrote:
> Use the range clearing primitive clear_user_pages() when clearing
> contiguous pages in clear_user_highpages().
> 
> We can safely do that when we have !CONFIG_HIGHMEM and when the
> architecture does not have clear_user_highpage.
> 
> The first is necessary because not doing intermediate maps for
> pages lets contiguous page ranges stay contiguous. The second,
> because if the architecture has clear_user_highpage(), it likely
> needs flushing magic when clearing the page, magic that we aren't
> privy to.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---

Can't you squash #4 into #2 if you move #3 in front of them? Or is there 
a dependency with #2 that I am missing?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 6/8] x86/clear_page: Introduce clear_pages()
  2025-12-15 20:49 ` [PATCH v10 6/8] x86/clear_page: Introduce clear_pages() Ankur Arora
@ 2025-12-18  7:22   ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  7:22 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, tglx,
	willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk

On 12/15/25 21:49, Ankur Arora wrote:
> Performance when clearing with string instructions (x86-64-stosq and
> similar) can vary significantly based on the chunk-size used.
> 
>    $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq
>    # Running 'mem/memset' benchmark:
>    # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
>    # Copying 4GB bytes ...
> 
>        13.748208 GB/sec
> 
>    $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq
>    # Running 'mem/memset' benchmark:
>    # function 'x86-64-stosq' (movsq-based memset() in
>    # arch/x86/lib/memset_64.S)
>    # Copying 4GB bytes ...
> 
>        15.067900 GB/sec
> 
>    $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq
>    # Running 'mem/memset' benchmark:
>    # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
>    # Copying 4GB bytes ...
> 
>        38.104311 GB/sec
> 
> (Both on AMD Milan.)
> 
> With a change in chunk-size from 4KB to 1GB, we see the performance go
> from 13.7 GB/sec to 38.1 GB/sec. For the chunk-size of 2MB the change isn't
> quite as drastic but it is worth adding a clear_page() variant that can
> handle contiguous page-extents.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

Nothing jumped at me.

Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-15 20:49 ` [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges Ankur Arora
  2025-12-16  2:44   ` Andrew Morton
@ 2025-12-18  7:36   ` David Hildenbrand (Red Hat)
  2025-12-18 20:16     ` Ankur Arora
  1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  7:36 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, tglx,
	willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk

On 12/15/25 21:49, Ankur Arora wrote:
> Clear contiguous page ranges in folio_zero_user() instead of clearing
> a single page at a time. Exposing larger ranges enables extent based
> processor optimizations.
> 
> However, because the underlying clearing primitives do not, or might
> not be able to check to call cond_resched() to check if preemption
> is required, limit the worst case preemption latency by doing the
> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
> 
> For architectures that define clear_pages(), we assume that the
> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
> worth of pages. This should be large enough to allow the processor
> to optimize the operation and yet small enough that we see reasonable
> preemption latency for when this optimization is not possible
> (ex. slow microarchitectures, memory bandwidth saturation.)
> 
> Architectures that don't define clear_pages() will continue to use
> the base value (single page). And, preemptible models don't need
> invocations of cond_resched() so don't care about the batch size.
> 
> The resultant performance depends on the kinds of optimizations
> available to the CPU for the region size being cleared. Two classes
> of optimizations:
> 
>    - clearing iteration costs are amortized over a range larger
>      than a single page.
>    - cacheline allocation elision (seen on AMD Zen models).
> 
> Testing a demand fault workload shows an improved baseline from the
> first optimization and a larger improvement when the region being
> cleared is large enough for the second optimization.
> 
> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
> 
>    $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
> 
>                      page-at-a-time     contiguous clearing      change
> 
>                    (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>     pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
> 
>     pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
>     pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy
> 
>   [#] Notice that we perform much better with preempt=full|lazy. As
>    mentioned above, preemptible models not needing explicit invocations
>    of cond_resched() allow clearing of the full extent (1GB) as a
>    single unit.
>    In comparison the maximum extent used for preempt=none|voluntary is
>    PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
> 
>    The larger extent allows the processor to elide cacheline
>    allocation (on Milan the threshold is LLC-size=32MB.)
> 
> Also as mentioned earlier, the baseline improvement is not specific to
> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
> improvement as the Milan pg-sz=2MB workload above (~30%).
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>   include/linux/mm.h | 38 +++++++++++++++++++++++++++++++++++++-
>   mm/memory.c        | 46 +++++++++++++++++++++++++---------------------
>   2 files changed, 62 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 12106ebf1a50..45e5e0ef620c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4194,7 +4194,6 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>   				unsigned int order) {}
>   #endif	/* CONFIG_DEBUG_PAGEALLOC */
>   
> -#ifndef clear_pages

Why is that change part of this patch?

Looks like this should either go into the patch introducing 
clear_pages() (#3 ?).

>   /**
>    * clear_pages() - clear a page range for kernel-internal use.
>    * @addr: start address
> @@ -4204,7 +4203,18 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>    * mapped to user space.
>    *
>    * Does absolutely no exception handling.
> + *
> + * Note that even though the clearing operation is preemptible, clear_pages()
> + * does not (and on architectures where it reduces to a few long-running
> + * instructions, might not be able to) call cond_resched() to check if
> + * rescheduling is required.
> + *
> + * When running under preemptible models this is fine, since clear_pages(),
> + * even when reduced to long-running instructions, is preemptible.
> + * Under cooperatively scheduled models, however, the caller is expected to
> + * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
>    */
> +#ifndef clear_pages
>   static inline void clear_pages(void *addr, unsigned int npages)
>   {
>   	do {
> @@ -4214,6 +4224,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
>   }
>   #endif
>   
> +#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
> +#ifdef clear_pages
> +/*
> + * The architecture defines clear_pages(), and we assume that it is
> + * generally "fast". So choose a batch size large enough to allow the processor
> + * headroom for optimizing the operation and yet small enough that we see
> + * reasonable preemption latency for when this optimization is not possible
> + * (ex. slow microarchitectures, memory bandwidth saturation.)
> + *
> + * With a value of 8MB and assuming a memory bandwidth of ~10GBps, this should
> + * result in worst case preemption latency of around 1ms when clearing pages.
> + *
> + * (See comment above clear_pages() for why preemption latency is a concern
> + * here.)
> + */
> +#define PROCESS_PAGES_NON_PREEMPT_BATCH		(8 << (20 - PAGE_SHIFT))
> +#else /* !clear_pages */
> +/*
> + * The architecture does not provide a clear_pages() implementation. Assume
> + * that clear_page() -- which clear_pages() will fallback to -- is relatively
> + * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
> + */
> +#define PROCESS_PAGES_NON_PREEMPT_BATCH		1
> +#endif
> +#endif
> +
>   #ifdef __HAVE_ARCH_GATE_AREA
>   extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
>   extern int in_gate_area_no_mm(unsigned long addr);
> diff --git a/mm/memory.c b/mm/memory.c
> index 2a55edc48a65..974c48db6089 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7237,40 +7237,44 @@ static inline int process_huge_page(
>   	return 0;
>   }
>   
> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
> -				unsigned int nr_pages)
> +static void clear_contig_highpages(struct page *page, unsigned long addr,
> +				   unsigned int npages)
>   {
> -	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> -	int i;
> +	unsigned int i, count, unit;
>   
> -	might_sleep();
> -	for (i = 0; i < nr_pages; i++) {
> +	/*
> +	 * When clearing we want to operate on the largest extent possible since
> +	 * that allows for extent based architecture specific optimizations.
> +	 *
> +	 * However, since the clearing interfaces (clear_user_highpages(),
> +	 * clear_user_pages(), clear_pages()), do not call cond_resched(), we
> +	 * limit the batch size when running under non-preemptible scheduling
> +	 * models.
> +	 */
> +	unit = preempt_model_preemptible() ? npages : PROCESS_PAGES_NON_PREEMPT_BATCH;
> +
> +	for (i = 0; i < npages; i += count) {
>   		cond_resched();
> -		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
> +
> +		count = min(unit, npages - i);
> +		clear_user_highpages(page + i,
> +				     addr + i * PAGE_SIZE, count);

I guess that logic could be pushed down for the 
clear_user_highpages()->clear_pages() implementation (arch or generic) 
to take care of that, so not every user would have to care about that.

No strong opinion as we could do that later whenever we actually get 
more clear_pages() users :)

> -
>   /**
>    * folio_zero_user - Zero a folio which will be mapped to userspace.
>    * @folio: The folio to zero.
> - * @addr_hint: The address will be accessed or the base address if uncelar.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * Uses architectural support to clear page ranges.

I think that comment can be dropped. Implementation detail :)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages
  2025-12-16  5:04   ` Ankur Arora
@ 2025-12-18  7:38     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  7:38 UTC (permalink / raw)
  To: Ankur Arora, Andrew Morton
  Cc: linux-kernel, linux-mm, x86, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, tglx, willy, raghavendra.kt, chleroy,
	ioworker0, boris.ostrovsky, konrad.wilk

On 12/16/25 06:04, Ankur Arora wrote:
> 
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
>> On Mon, 15 Dec 2025 12:49:14 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>> This series adds clearing of contiguous page ranges for hugepages.
>>>
>>
>> Review is a bit thin at this time but it's at v10 and seems a thing we
>> want to push ahead with.
> 
> I think David Hildebrand has looked at all the HIGHMEM patches quite closely
> (though, not this iteration where I reorganized things a bit.)

Yeah, I've been reviweing this from the very beginning .... and I hope 
we won't start to rush things in simply because I am not looking for 5 
days ;)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages
  2025-12-15 20:49 ` [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages Ankur Arora
@ 2025-12-18  7:49   ` David Hildenbrand (Red Hat)
  2025-12-18 21:01     ` Ankur Arora
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18  7:49 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, tglx,
	willy, raghavendra.kt, chleroy, ioworker0, boris.ostrovsky,
	konrad.wilk

On 12/15/25 21:49, Ankur Arora wrote:
> folio_zero_user() does straight zeroing without caring about
> temporal locality for caches.
> 
> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
> algorithm into a separate function") where we cleared a page at a
> time converging to the faulting page from the left and the right.
> 
> To retain limited temporal locality, split the clearing in three
> parts: the faulting page and its immediate neighbourhood, and, the
> remaining regions on the left and the right. The local neighbourhood
> will be cleared last.
> Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
> there isn't much expectation of cache locality for large folios.
> 
> Performance
> ===
> 
> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
>    memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
> 
> anon-w-seq (vm-scalability):
>                              stime                  utime
> 
>    page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>    contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
>    neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
> 
> Both stime and utime respond in expected ways. stime drops for both
> contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
> approaches. However, utime increases for both contiguous clearing
> (+19.7%) and neighbourhood-last (+9.28%).
> 
> In part this is because anon-w-seq runs with 384 processes zeroing
> anonymously mapped memory which they then access sequentially. As
> such this is likely an uncommon pattern where the memory bandwidth
> is saturated while also being cache limited because we access the
> entire region.
> 
> Kernel make workload (make -j 12 bzImage):
> 
>                              stime                  utime
> 
>    page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
>    contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
>    neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )
> 
> For make the utime stays relatively flat with an up to 4.9% improvement
> in the stime.

Nice evaluation!

> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>   mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 974c48db6089..d22348b95227 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
>    * @addr_hint: The address accessed by the user or the base address.
>    *
>    * Uses architectural support to clear page ranges.
> + *
> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
> + * pages in the immediate locality of the faulting page, and its left, right
> + * regions; the local neighbourhood is cleared last in order to keep cache
> + * lines of the faulting region hot.
> + *
> + * For larger folios we assume that there is no expectation of cache locality
> + * and just do a straight zero.

Just wondering: why not do the same thing here as well? Probably 
shouldn't hurt and would get rid of some code?

>    */
>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>   {
>   	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));

While at it you could turn that const as well.

> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> +	const int width = 2; /* number of pages cleared last on either side */

Is "width" really the right terminology? (the way you describe it, it's 
more like diameter?)

Wondering whether we should turn that into a define to make it clearer 
that we are dealing with a magic value.

Speaking of magic values, why 2 and not 3? :)

> +	struct range r[3];
> +	int i;
>   
> -	clear_contig_highpages(folio_page(folio, 0),
> -				base_addr, folio_nr_pages(folio));
> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> +		clear_contig_highpages(folio_page(folio, 0),
> +				       base_addr, folio_nr_pages(folio));
> +		return;
> +	}
> +
> +	/*
> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
> +	 * ensure it sticks around in the cache.
> +	 */
> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
> +
> +	/* Region to the left of the fault */
> +	r[1] = DEFINE_RANGE(pg.start,
> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));

"start-1" -> "start - 1" etc.

> +
> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> +			    pg.end);

Same here.

> +
> +	for (i = 0; i <= 2; i++) {

Can we use ARRAY_SIZE instead of "2" ?

> +		unsigned int npages = range_len(&r[i]);
> +		struct page *page = folio_page(folio, r[i].start);
> +		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;

Can't you compute that from r[i].start) instead? The folio_page_idx() 
seems avoidable unless I am missing something.

Could make npages and addr const.

const unsigned long addr = base_addr + r[i].start * PAGE_SIZE;
const unsigned int npages = range_len(&r[i]);
struct page *page = folio_page(folio, r[i].start);

> +
> +		if (npages > 0)
> +			clear_contig_highpages(page, addr, npages);
> +	}
>   }
>   
>   static int copy_user_gigantic_page(struct folio *dst, struct folio *src,


-- 
Cheers

David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant
  2025-12-18  7:11   ` David Hildenbrand (Red Hat)
@ 2025-12-18 19:31     ` Ankur Arora
  0 siblings, 0 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-18 19:31 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 12/15/25 21:49, Ankur Arora wrote:
>> From: David Hildenbrand <david@redhat.com>
>> Let's drop all variants that effectively map to clear_page() and
>> provide it in a generic variant instead.
>> We'll use the macro clear_user_page to indicate whether an architecture
>> provides it's own variant. Maybe at some point these should be CONFIG_
>> options.
>
> Can we drop the second sentence?

Ack.

>> Also, clear_user_page() is only called from the generic variant of
>> clear_user_highpage(), so define it only if the architecture does
>> not provide a clear_user_highpage(). And, for simplicity define it
>> in linux/highmem.h.
>> Note that for parisc, clear_page() and clear_user_page() map to
>> clear_page_asm(), so we can just get rid of the custom clear_user_page()
>> implementation. There is a clear_user_page_asm() function on parisc,
>> that seems to be unused. Not sure what's up with that.
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> You should likely now add
>
> Co-developed-by: Ankur Arora <ankur.a.arora@oracle.com>
>
> above your SB :)

:).

>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>
> Skimmed over it and nothing jumped at me.

Great!

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 4/8] highmem: do range clearing in clear_user_highpages()
  2025-12-18  7:15   ` David Hildenbrand (Red Hat)
@ 2025-12-18 20:01     ` Ankur Arora
  0 siblings, 0 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-18 20:01 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 12/15/25 21:49, Ankur Arora wrote:
>> Use the range clearing primitive clear_user_pages() when clearing
>> contiguous pages in clear_user_highpages().
>> We can safely do that when we have !CONFIG_HIGHMEM and when the
>> architecture does not have clear_user_highpage.
>> The first is necessary because not doing intermediate maps for
>> pages lets contiguous page ranges stay contiguous. The second,
>> because if the architecture has clear_user_highpage(), it likely
>> needs flushing magic when clearing the page, magic that we aren't
>> privy to.
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>
> Can't you squash #4 into #2 if you move #3 in front of them? Or is there a
> dependency with #2 that I am missing?

I thought it would be clearer to split this into #2 and #4 but #2
is completely obvious so doesn't help much.

Will squash.

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
  2025-12-18  7:36   ` David Hildenbrand (Red Hat)
@ 2025-12-18 20:16     ` Ankur Arora
  0 siblings, 0 replies; 31+ messages in thread
From: Ankur Arora @ 2025-12-18 20:16 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 12/15/25 21:49, Ankur Arora wrote:
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a single page at a time. Exposing larger ranges enables extent based
>> processor optimizations.
>> However, because the underlying clearing primitives do not, or might
>> not be able to check to call cond_resched() to check if preemption
>> is required, limit the worst case preemption latency by doing the
>> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
>> For architectures that define clear_pages(), we assume that the
>> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
>> worth of pages. This should be large enough to allow the processor
>> to optimize the operation and yet small enough that we see reasonable
>> preemption latency for when this optimization is not possible
>> (ex. slow microarchitectures, memory bandwidth saturation.)
>> Architectures that don't define clear_pages() will continue to use
>> the base value (single page). And, preemptible models don't need
>> invocations of cond_resched() so don't care about the batch size.
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region size being cleared. Two classes
>> of optimizations:
>>    - clearing iteration costs are amortized over a range larger
>>      than a single page.
>>    - cacheline allocation elision (seen on AMD Zen models).
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>>    $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>>                      page-at-a-time     contiguous clearing      change
>>                    (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>     pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%
>> preempt=*
>>     pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%
>> preempt=none|voluntary
>>     pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy
>>   [#] Notice that we perform much better with preempt=full|lazy. As
>>    mentioned above, preemptible models not needing explicit invocations
>>    of cond_resched() allow clearing of the full extent (1GB) as a
>>    single unit.
>>    In comparison the maximum extent used for preempt=none|voluntary is
>>    PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>>    The larger extent allows the processor to elide cacheline
>>    allocation (on Milan the threshold is LLC-size=32MB.)
>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   include/linux/mm.h | 38 +++++++++++++++++++++++++++++++++++++-
>>   mm/memory.c        | 46 +++++++++++++++++++++++++---------------------
>>   2 files changed, 62 insertions(+), 22 deletions(-)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 12106ebf1a50..45e5e0ef620c 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4194,7 +4194,6 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>>   				unsigned int order) {}
>>   #endif	/* CONFIG_DEBUG_PAGEALLOC */
>>   -#ifndef clear_pages
>
> Why is that change part of this patch?
>
> Looks like this should either go into the patch introducing clear_pages() (#3
> ?).

Yeah I think this was a mistake. There was no need to move the ifndef
below the comment as I do here. Will fix.

>>   /**
>>    * clear_pages() - clear a page range for kernel-internal use.
>>    * @addr: start address
>> @@ -4204,7 +4203,18 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>>    * mapped to user space.
>>    *
>>    * Does absolutely no exception handling.
>> + *
>> + * Note that even though the clearing operation is preemptible, clear_pages()
>> + * does not (and on architectures where it reduces to a few long-running
>> + * instructions, might not be able to) call cond_resched() to check if
>> + * rescheduling is required.
>> + *
>> + * When running under preemptible models this is fine, since clear_pages(),
>> + * even when reduced to long-running instructions, is preemptible.
>> + * Under cooperatively scheduled models, however, the caller is expected to
>> + * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
>>    */
>> +#ifndef clear_pages
>>   static inline void clear_pages(void *addr, unsigned int npages)
>>   {
>>   	do {
>> @@ -4214,6 +4224,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
>>   }
>>   #endif
>>   +#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
>> +#ifdef clear_pages
>> +/*
>> + * The architecture defines clear_pages(), and we assume that it is
>> + * generally "fast". So choose a batch size large enough to allow the processor
>> + * headroom for optimizing the operation and yet small enough that we see
>> + * reasonable preemption latency for when this optimization is not possible
>> + * (ex. slow microarchitectures, memory bandwidth saturation.)
>> + *
>> + * With a value of 8MB and assuming a memory bandwidth of ~10GBps, this should
>> + * result in worst case preemption latency of around 1ms when clearing pages.
>> + *
>> + * (See comment above clear_pages() for why preemption latency is a concern
>> + * here.)
>> + */
>> +#define PROCESS_PAGES_NON_PREEMPT_BATCH		(8 << (20 - PAGE_SHIFT))
>> +#else /* !clear_pages */
>> +/*
>> + * The architecture does not provide a clear_pages() implementation. Assume
>> + * that clear_page() -- which clear_pages() will fallback to -- is relatively
>> + * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
>> + */
>> +#define PROCESS_PAGES_NON_PREEMPT_BATCH		1
>> +#endif
>> +#endif
>> +
>>   #ifdef __HAVE_ARCH_GATE_AREA
>>   extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
>>   extern int in_gate_area_no_mm(unsigned long addr);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 2a55edc48a65..974c48db6089 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7237,40 +7237,44 @@ static inline int process_huge_page(
>>   	return 0;
>>   }
>>   -static void clear_gigantic_page(struct folio *folio, unsigned long
>> addr_hint,
>> -				unsigned int nr_pages)
>> +static void clear_contig_highpages(struct page *page, unsigned long addr,
>> +				   unsigned int npages)
>>   {
>> -	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> -	int i;
>> +	unsigned int i, count, unit;
>>   -	might_sleep();
>> -	for (i = 0; i < nr_pages; i++) {
>> +	/*
>> +	 * When clearing we want to operate on the largest extent possible since
>> +	 * that allows for extent based architecture specific optimizations.
>> +	 *
>> +	 * However, since the clearing interfaces (clear_user_highpages(),
>> +	 * clear_user_pages(), clear_pages()), do not call cond_resched(), we
>> +	 * limit the batch size when running under non-preemptible scheduling
>> +	 * models.
>> +	 */
>> +	unit = preempt_model_preemptible() ? npages : PROCESS_PAGES_NON_PREEMPT_BATCH;
>> +
>> +	for (i = 0; i < npages; i += count) {
>>   		cond_resched();
>> -		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>> +
>> +		count = min(unit, npages - i);
>> +		clear_user_highpages(page + i,
>> +				     addr + i * PAGE_SIZE, count);
>
> I guess that logic could be pushed down for the
> clear_user_highpages()->clear_pages() implementation (arch or generic)
> to take care of that, so not every user would have to care about that.

You mean the preemptibility unit stuff?

> No strong opinion as we could do that later whenever we actually get more
> clear_pages() users :)

I remember thinking about it early on. And seemed to me that clear_pages(),
clear_user_pages(), clear_user_highpages() all of them are structured
pretty similarly and all of them leave the the responsibility of when to
preempt to the caller.

I guess clear_user_highpages() is special since that's the logical entry
point for user pages.

I'd like to keep it as it for now. And relook at changing once clear_pages()
clear_pages() has a few more users.

>> -
>>   /**
>>    * folio_zero_user - Zero a folio which will be mapped to userspace.
>>    * @folio: The folio to zero.
>> - * @addr_hint: The address will be accessed or the base address if uncelar.
>> + * @addr_hint: The address accessed by the user or the base address.
>> + *
>> + * Uses architectural support to clear page ranges.
>
> I think that comment can be dropped. Implementation detail :)

Ack.

Thanks for the reviews!

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages
  2025-12-18  7:49   ` David Hildenbrand (Red Hat)
@ 2025-12-18 21:01     ` Ankur Arora
  2025-12-18 21:23       ` Ankur Arora
  0 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-18 21:01 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, tglx, willy, raghavendra.kt,
	chleroy, ioworker0, boris.ostrovsky, konrad.wilk


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 12/15/25 21:49, Ankur Arora wrote:
>> folio_zero_user() does straight zeroing without caring about
>> temporal locality for caches.
>> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
>> algorithm into a separate function") where we cleared a page at a
>> time converging to the faulting page from the left and the right.
>> To retain limited temporal locality, split the clearing in three
>> parts: the faulting page and its immediate neighbourhood, and, the
>> remaining regions on the left and the right. The local neighbourhood
>> will be cleared last.
>> Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
>> there isn't much expectation of cache locality for large folios.
>> Performance
>> ===
>> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
>>    memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
>> anon-w-seq (vm-scalability):
>>                              stime                  utime
>>    page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>    contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
>>    neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>> Both stime and utime respond in expected ways. stime drops for both
>> contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
>> approaches. However, utime increases for both contiguous clearing
>> (+19.7%) and neighbourhood-last (+9.28%).
>> In part this is because anon-w-seq runs with 384 processes zeroing
>> anonymously mapped memory which they then access sequentially. As
>> such this is likely an uncommon pattern where the memory bandwidth
>> is saturated while also being cache limited because we access the
>> entire region.
>> Kernel make workload (make -j 12 bzImage):
>>                              stime                  utime
>>    page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
>>    contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
>>    neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )
>> For make the utime stays relatively flat with an up to 4.9% improvement
>> in the stime.
>
> Nice evaluation!
>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 42 insertions(+), 2 deletions(-)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 974c48db6089..d22348b95227 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
>>    * @addr_hint: The address accessed by the user or the base address.
>>    *
>>    * Uses architectural support to clear page ranges.
>> + *
>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>> + * pages in the immediate locality of the faulting page, and its left, right
>> + * regions; the local neighbourhood is cleared last in order to keep cache
>> + * lines of the faulting region hot.
>> + *
>> + * For larger folios we assume that there is no expectation of cache locality
>> + * and just do a straight zero.
>
> Just wondering: why not do the same thing here as well? Probably shouldn't hurt
> and would get rid of some code?

That's a good point. With only a three way split, there's no reason to
treat large folios specially.

>>    */
>>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>   {
>>   	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>
> While at it you could turn that const as well.

Ack.

>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> +	const int width = 2; /* number of pages cleared last on either side */
>
> Is "width" really the right terminology? (the way you describe it, it's more
> like diameter?)

I like diameter. Will make that a define.

> Wondering whether we should turn that into a define to make it clearer that we
> are dealing with a magic value.
>
> Speaking of magic values, why 2 and not 3? :)

I think I had tried both :). The performance was pretty much the same.

But also, this is probably a function of the benchmark used. And I'm
not sure I have a good one (unless kernel build with THP counts which
isn't very responsive).

>> +	struct range r[3];
>> +	int i;
>>   -	clear_contig_highpages(folio_page(folio, 0),
>> -				base_addr, folio_nr_pages(folio));
>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> +		clear_contig_highpages(folio_page(folio, 0),
>> +				       base_addr, folio_nr_pages(folio));
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>> +	 * ensure it sticks around in the cache.
>> +	 */
>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>> +
>> +	/* Region to the left of the fault */
>> +	r[1] = DEFINE_RANGE(pg.start,
>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>
> "start-1" -> "start - 1" etc.
>
>> +
>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>> +			    pg.end);
>
> Same here.
>
>> +
>> +	for (i = 0; i <= 2; i++) {
>
> Can we use ARRAY_SIZE instead of "2" ?

Ack to all of the these.

>> +		unsigned int npages = range_len(&r[i]);
>> +		struct page *page = folio_page(folio, r[i].start);
>> +		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
>
> Can't you compute that from r[i].start) instead? The folio_page_idx() seems
> avoidable unless I am missing something.
>
> Could make npages and addr const.
>
> const unsigned long addr = base_addr + r[i].start * PAGE_SIZE;
> const unsigned int npages = range_len(&r[i]);
> struct page *page = folio_page(folio, r[i].start);

Thanks. Yeah, all of this makes sense.

Will fix.

>> +
>> +		if (npages > 0)
>> +			clear_contig_highpages(page, addr, npages);
>> +	}
>>   }
>>     static int copy_user_gigantic_page(struct folio *dst, struct folio *src,

Thanks for the review!

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages
  2025-12-18 21:01     ` Ankur Arora
@ 2025-12-18 21:23       ` Ankur Arora
  2025-12-23 10:11         ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 31+ messages in thread
From: Ankur Arora @ 2025-12-18 21:23 UTC (permalink / raw)
  To: Ankur Arora
  Cc: David Hildenbrand (Red Hat),
	linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, tglx, willy, raghavendra.kt, chleroy,
	ioworker0, boris.ostrovsky, konrad.wilk


Ankur Arora <ankur.a.arora@oracle.com> writes:

> David Hildenbrand (Red Hat) <david@kernel.org> writes:
>
>> On 12/15/25 21:49, Ankur Arora wrote:
>>> folio_zero_user() does straight zeroing without caring about
>>> temporal locality for caches.
>>> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
>>> algorithm into a separate function") where we cleared a page at a
>>> time converging to the faulting page from the left and the right.
>>> To retain limited temporal locality, split the clearing in three
>>> parts: the faulting page and its immediate neighbourhood, and, the
>>> remaining regions on the left and the right. The local neighbourhood
>>> will be cleared last.
>>> Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
>>> there isn't much expectation of cache locality for large folios.
>>> Performance
>>> ===
>>> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
>>>    memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
>>> anon-w-seq (vm-scalability):
>>>                              stime                  utime
>>>    page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>>    contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
>>>    neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>> Both stime and utime respond in expected ways. stime drops for both
>>> contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
>>> approaches. However, utime increases for both contiguous clearing
>>> (+19.7%) and neighbourhood-last (+9.28%).
>>> In part this is because anon-w-seq runs with 384 processes zeroing
>>> anonymously mapped memory which they then access sequentially. As
>>> such this is likely an uncommon pattern where the memory bandwidth
>>> is saturated while also being cache limited because we access the
>>> entire region.
>>> Kernel make workload (make -j 12 bzImage):
>>>                              stime                  utime
>>>    page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
>>>    contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
>>>    neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )
>>> For make the utime stays relatively flat with an up to 4.9% improvement
>>> in the stime.
>>
>> Nice evaluation!
>>
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>>> ---
>>>   mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>>   1 file changed, 42 insertions(+), 2 deletions(-)
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 974c48db6089..d22348b95227 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
>>>    * @addr_hint: The address accessed by the user or the base address.
>>>    *
>>>    * Uses architectural support to clear page ranges.
>>> + *
>>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>>> + * pages in the immediate locality of the faulting page, and its left, right
>>> + * regions; the local neighbourhood is cleared last in order to keep cache
>>> + * lines of the faulting region hot.
>>> + *
>>> + * For larger folios we assume that there is no expectation of cache locality
>>> + * and just do a straight zero.
>>
>> Just wondering: why not do the same thing here as well? Probably shouldn't hurt
>> and would get rid of some code?
>
> That's a good point. With only a three way split, there's no reason to
> treat large folios specially.

A bit more on this: this change makes sense but I'll retain the current
split between patches-7, 8.

Where patch-7, is used to justify using contiguous clearing (and the
choice of value for PROCESS_PAGES_NON_PREEMPT_BATCH), unit based on
preemption model etc and patch-8, for the neighbourhood optimization.

>>>    */
>>>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>   {
>>>   	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>>
>> While at it you could turn that const as well.
>
> Ack.
>
>>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>>> +	const int width = 2; /* number of pages cleared last on either side */
>>
>> Is "width" really the right terminology? (the way you describe it, it's more
>> like diameter?)
>
> I like diameter. Will make that a define.

I'll make that radius since that's how I'm using it.

Thanks
Ankur

>> Wondering whether we should turn that into a define to make it clearer that we
>> are dealing with a magic value.
>>
>> Speaking of magic values, why 2 and not 3? :)
>
> I think I had tried both :). The performance was pretty much the same.
>
> But also, this is probably a function of the benchmark used. And I'm
> not sure I have a good one (unless kernel build with THP counts which
> isn't very responsive).
>
>>> +	struct range r[3];
>>> +	int i;
>>>   -	clear_contig_highpages(folio_page(folio, 0),
>>> -				base_addr, folio_nr_pages(folio));
>>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>>> +		clear_contig_highpages(folio_page(folio, 0),
>>> +				       base_addr, folio_nr_pages(folio));
>>> +		return;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>>> +	 * ensure it sticks around in the cache.
>>> +	 */
>>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>>> +
>>> +	/* Region to the left of the fault */
>>> +	r[1] = DEFINE_RANGE(pg.start,
>>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>>
>> "start-1" -> "start - 1" etc.
>>
>>> +
>>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>>> +			    pg.end);
>>
>> Same here.
>>
>>> +
>>> +	for (i = 0; i <= 2; i++) {
>>
>> Can we use ARRAY_SIZE instead of "2" ?
>
> Ack to all of the these.
>
>>> +		unsigned int npages = range_len(&r[i]);
>>> +		struct page *page = folio_page(folio, r[i].start);
>>> +		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
>>
>> Can't you compute that from r[i].start) instead? The folio_page_idx() seems
>> avoidable unless I am missing something.
>>
>> Could make npages and addr const.
>>
>> const unsigned long addr = base_addr + r[i].start * PAGE_SIZE;
>> const unsigned int npages = range_len(&r[i]);
>> struct page *page = folio_page(folio, r[i].start);
>
> Thanks. Yeah, all of this makes sense.
>
> Will fix.
>
>>> +
>>> +		if (npages > 0)
>>> +			clear_contig_highpages(page, addr, npages);
>>> +	}
>>>   }
>>>     static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>
> Thanks for the review!

--
ankur


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages
  2025-12-18 21:23       ` Ankur Arora
@ 2025-12-23 10:11         ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-23 10:11 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, tglx, willy, raghavendra.kt, chleroy,
	ioworker0, boris.ostrovsky, konrad.wilk

On 12/18/25 22:23, Ankur Arora wrote:
> 
> Ankur Arora <ankur.a.arora@oracle.com> writes:
> 
>> David Hildenbrand (Red Hat) <david@kernel.org> writes:
>>
>>> On 12/15/25 21:49, Ankur Arora wrote:
>>>> folio_zero_user() does straight zeroing without caring about
>>>> temporal locality for caches.
>>>> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
>>>> algorithm into a separate function") where we cleared a page at a
>>>> time converging to the faulting page from the left and the right.
>>>> To retain limited temporal locality, split the clearing in three
>>>> parts: the faulting page and its immediate neighbourhood, and, the
>>>> remaining regions on the left and the right. The local neighbourhood
>>>> will be cleared last.
>>>> Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
>>>> there isn't much expectation of cache locality for large folios.
>>>> Performance
>>>> ===
>>>> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
>>>>     memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
>>>> anon-w-seq (vm-scalability):
>>>>                               stime                  utime
>>>>     page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>>>     contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
>>>>     neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>>> Both stime and utime respond in expected ways. stime drops for both
>>>> contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
>>>> approaches. However, utime increases for both contiguous clearing
>>>> (+19.7%) and neighbourhood-last (+9.28%).
>>>> In part this is because anon-w-seq runs with 384 processes zeroing
>>>> anonymously mapped memory which they then access sequentially. As
>>>> such this is likely an uncommon pattern where the memory bandwidth
>>>> is saturated while also being cache limited because we access the
>>>> entire region.
>>>> Kernel make workload (make -j 12 bzImage):
>>>>                               stime                  utime
>>>>     page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
>>>>     contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
>>>>     neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )
>>>> For make the utime stays relatively flat with an up to 4.9% improvement
>>>> in the stime.
>>>
>>> Nice evaluation!
>>>
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>>>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>>>> ---
>>>>    mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 42 insertions(+), 2 deletions(-)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 974c48db6089..d22348b95227 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
>>>>     * @addr_hint: The address accessed by the user or the base address.
>>>>     *
>>>>     * Uses architectural support to clear page ranges.
>>>> + *
>>>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>>>> + * pages in the immediate locality of the faulting page, and its left, right
>>>> + * regions; the local neighbourhood is cleared last in order to keep cache
>>>> + * lines of the faulting region hot.
>>>> + *
>>>> + * For larger folios we assume that there is no expectation of cache locality
>>>> + * and just do a straight zero.
>>>
>>> Just wondering: why not do the same thing here as well? Probably shouldn't hurt
>>> and would get rid of some code?
>>
>> That's a good point. With only a three way split, there's no reason to
>> treat large folios specially.
> 
> A bit more on this: this change makes sense but I'll retain the current
> split between patches-7, 8.
> 
> Where patch-7, is used to justify using contiguous clearing (and the
> choice of value for PROCESS_PAGES_NON_PREEMPT_BATCH), unit based on
> preemption model etc and patch-8, for the neighbourhood optimization.
> 
>>>>     */
>>>>    void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>>    {
>>>>    	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>>>
>>> While at it you could turn that const as well.
>>
>> Ack.
>>
>>>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>>>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>>>> +	const int width = 2; /* number of pages cleared last on either side */
>>>
>>> Is "width" really the right terminology? (the way you describe it, it's more
>>> like diameter?)
>>
>> I like diameter. Will make that a define.
> 
> I'll make that radius since that's how I'm using it.

All makes sense to me.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2025-12-23 10:11 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
2025-12-15 20:49 ` [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant Ankur Arora
2025-12-18  7:11   ` David Hildenbrand (Red Hat)
2025-12-18 19:31     ` Ankur Arora
2025-12-15 20:49 ` [PATCH v10 2/8] highmem: introduce clear_user_highpages() Ankur Arora
2025-12-15 20:49 ` [PATCH v10 3/8] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
2025-12-15 20:49 ` [PATCH v10 4/8] highmem: do range clearing in clear_user_highpages() Ankur Arora
2025-12-18  7:15   ` David Hildenbrand (Red Hat)
2025-12-18 20:01     ` Ankur Arora
2025-12-15 20:49 ` [PATCH v10 5/8] x86/mm: Simplify clear_page_* Ankur Arora
2025-12-15 20:49 ` [PATCH v10 6/8] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-12-18  7:22   ` David Hildenbrand (Red Hat)
2025-12-15 20:49 ` [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges Ankur Arora
2025-12-16  2:44   ` Andrew Morton
2025-12-16  6:49     ` Ankur Arora
2025-12-16 15:12       ` Andrew Morton
2025-12-17  8:48         ` Ankur Arora
2025-12-17 18:54           ` Andrew Morton
2025-12-17 19:51             ` Ankur Arora
2025-12-17 20:26               ` Andrew Morton
2025-12-18  0:51                 ` Ankur Arora
2025-12-18  7:36   ` David Hildenbrand (Red Hat)
2025-12-18 20:16     ` Ankur Arora
2025-12-15 20:49 ` [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages Ankur Arora
2025-12-18  7:49   ` David Hildenbrand (Red Hat)
2025-12-18 21:01     ` Ankur Arora
2025-12-18 21:23       ` Ankur Arora
2025-12-23 10:11         ` David Hildenbrand (Red Hat)
2025-12-16  2:48 ` [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Andrew Morton
2025-12-16  5:04   ` Ankur Arora
2025-12-18  7:38     ` David Hildenbrand (Red Hat)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox