linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: akpm@linux-foundation.org, david@kernel.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
	mjguzik@gmail.com, luto@kernel.org, peterz@infradead.org,
	tglx@linutronix.de, willy@infradead.org, raghavendra.kt@amd.com,
	chleroy@kernel.org, ioworker0@gmail.com, lizhe.67@bytedance.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
	ankur.a.arora@oracle.com
Subject: [PATCH v11 7/8] mm: folio_zero_user: clear page ranges
Date: Tue,  6 Jan 2026 23:20:08 -0800	[thread overview]
Message-ID: <20260107072009.1615991-8-ankur.a.arora@oracle.com> (raw)
In-Reply-To: <20260107072009.1615991-1-ankur.a.arora@oracle.com>

Use batch clearing in clear_contig_highpages() instead of clearing a
single page at a time. Exposing larger ranges enables the processor to
optimize based on extent.

To do this we just switch to using clear_user_highpages() which would
in turn use clear_user_pages() or clear_pages().

Batched clearing, when running under non-preemptible models, however,
has latency considerations. In particular, we need periodic invocations
of cond_resched() to keep to reasonable preemption latencies.
This is a problem because the clearing primitives do not, or might not
be able to, call cond_resched() to check if preemption is needed.

So, limit the worst case preemption latency by doing the clearing in
units of no more than PROCESS_PAGES_NON_PREEMPT_BATCH pages.
(Preemptible models already define away most of cond_resched(), so the
batch size is ignored when running under those.)

PROCESS_PAGES_NON_PREEMPT_BATCH: for architectures with "fast"
clear-pages (ones that define clear_pages()), we define it as 32MB
worth of pages. This is meant to be large enough to allow the processor
to optimize the operation and yet small enough that we see reasonable
preemption latency for when this optimization is not possible
(ex. slow microarchitectures, memory bandwidth saturation.)

This specific value also allows for a cacheline allocation elision
optimization (which might help unrelated applications by not evicting
potentially useful cache lines) that kicks in recent generations of
AMD Zen processors at around LLC-size (32MB is a typical size).

At the same time 32MB is small enough that even with poor clearing
bandwidth (say ~10GBps), time to clear 32MB should be well below the
scheduler's default warning threshold (sysctl_resched_latency_warn_ms=100).

"Slow" architectures (don't have clear_pages()) will continue to use
the base value (single page).

Performance
==

Testing a demand fault workload shows a decent improvement in bandwidth
with pg-sz=1GB. Bandwidth with pg-sz=2MB stays flat.

 $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5

                   contiguous-pages       batched-pages
                   (GBps +- %stdev)      (GBps +- %stdev)

   pg-sz=2MB       23.58 +- 1.95%        25.34 +- 1.18%       +  7.50%  preempt=*

   pg-sz=1GB       25.09 +- 0.79%        39.22 +- 2.32%       + 56.31%  preempt=none|voluntary
   pg-sz=1GB       25.71 +- 0.03%        52.73 +- 0.20% [#]   +110.16%  preempt=full|lazy

 [#] We perform much better with preempt=full|lazy because, not
  needing explicit invocations of cond_resched() we can clear the
  full extent (pg-sz=1GB) as a single unit which the processor
  can optimize for.

 (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13);
  region-size=64GB, local node; 2.56 GHz, boost=0.)

Analysis
==

pg-sz=1GB: the improvement we see falls in two buckets depending on
the batch size in use.

For batch-size=32MB the number of cachelines allocated (L1-dcache-loads)
-- which stay relatively flat for smaller batches, start to drop off
because cacheline allocation elision kicks in. And as can be seen below,
at batch-size=1GB, we stop allocating cachelines almost entirely.
(Not visible here but from testing with intermediate sizes, the
allocation change kicks in only at batch-size=32MB and ramps up from
there.)

 contigous-pages       6,949,417,798      L1-dcache-loads                  #  883.599 M/sec                       ( +-  0.01% )  (35.75%)
                       3,226,709,573      L1-dcache-load-misses            #   46.43% of all L1-dcache accesses   ( +-  0.05% )  (35.75%)

    batched,32MB       2,290,365,772      L1-dcache-loads                  #  471.171 M/sec                       ( +-  0.36% )  (35.72%)
                       1,144,426,272      L1-dcache-load-misses            #   49.97% of all L1-dcache accesses   ( +-  0.58% )  (35.70%)

    batched,1GB           63,914,157      L1-dcache-loads                  #   17.464 M/sec                       ( +-  8.08% )  (35.73%)
                          22,074,367      L1-dcache-load-misses            #   34.54% of all L1-dcache accesses   ( +- 16.70% )  (35.70%)

The dropoff is also visible in L2 prefetch hits (miss numbers are
on similar lines):

 contiguous-pages      3,464,861,312      l2_pf_hit_l2.all                 #  437.722 M/sec                       ( +-  0.74% )  (15.69%)

   batched,32MB          883,750,087      l2_pf_hit_l2.all                 #  181.223 M/sec                       ( +-  1.18% )  (15.71%)

    batched,1GB            8,967,943      l2_pf_hit_l2.all                 #    2.450 M/sec                       ( +- 17.92% )  (15.77%)

This largely decouples the frontend from the backend since the clearing
operation does not need to wait on loads from memory (we still need
cacheline ownership but that's a shorter path). This is most visible
if we rerun the test above with (boost=1, 3.66 GHz).

 $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5

                   contiguous-pages       batched-pages
                   (GBps +- %stdev)      (GBps +- %stdev)

   pg-sz=2MB       26.08 +- 1.72%        26.13 +- 0.92%           -     preempt=*            

   pg-sz=1GB       26.99 +- 0.62%        48.85 +- 2.19%       + 80.99%  preempt=none|voluntary
   pg-sz=1GB       27.69 +- 0.18%        75.18 +- 0.25%       +171.50%  preempt=full|lazy     

Comparing the batched-pages numbers from the boost=0 ones and these: for
a clock-speed gain of 42% we gain 24.5% for batch-size=32MB and 42.5%
for batch-size=1GB.
In comparison the baseline contiguous-pages case and both the
pg-sz=2MB ones are largely backend bound so gain no more than ~10%.

Other platforms tested, Intel Icelakex (Oracle X9) and ARM64 Neoverse-N1
(Ampere Altra) both show an improvement of ~35% for pg-sz=2MB|1GB.
The first goes from around 8GBps to 11GBps and the second from 32GBps
to 44 GBPs.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/mm.h | 36 ++++++++++++++++++++++++++++++++++++
 mm/memory.c        | 18 +++++++++++++++---
 2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a4a9a8d1ffec..fb5b86d78093 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4204,6 +4204,15 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
  * mapped to user space.
  *
  * Does absolutely no exception handling.
+ *
+ * Note that even though the clearing operation is preemptible, clear_pages()
+ * does not (and on architectures where it reduces to a few long-running
+ * instructions, might not be able to) call cond_resched() to check if
+ * rescheduling is required.
+ *
+ * When running under preemptible models this is not a problem. Under
+ * cooperatively scheduled models, however, the caller is expected to
+ * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
  */
 static inline void clear_pages(void *addr, unsigned int npages)
 {
@@ -4214,6 +4224,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
 }
 #endif
 
+#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
+#ifdef clear_pages
+/*
+ * The architecture defines clear_pages(), and we assume that it is
+ * generally "fast". So choose a batch size large enough to allow the processor
+ * headroom for optimizing the operation and yet small enough that we see
+ * reasonable preemption latency for when this optimization is not possible
+ * (ex. slow microarchitectures, memory bandwidth saturation.)
+ *
+ * With a value of 32MB and assuming a memory bandwidth of ~10GBps, this should
+ * result in worst case preemption latency of around 3ms when clearing pages.
+ *
+ * (See comment above clear_pages() for why preemption latency is a concern
+ * here.)
+ */
+#define PROCESS_PAGES_NON_PREEMPT_BATCH		(32 << (20 - PAGE_SHIFT))
+#else /* !clear_pages */
+/*
+ * The architecture does not provide a clear_pages() implementation. Assume
+ * that clear_page() -- which clear_pages() will fallback to -- is relatively
+ * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
+ */
+#define PROCESS_PAGES_NON_PREEMPT_BATCH		1
+#endif
+#endif
+
 #ifdef __HAVE_ARCH_GATE_AREA
 extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
 extern int in_gate_area_no_mm(unsigned long addr);
diff --git a/mm/memory.c b/mm/memory.c
index c06e43a8861a..49e7154121f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7240,13 +7240,25 @@ static inline int process_huge_page(
 static void clear_contig_highpages(struct page *page, unsigned long addr,
 				   unsigned int nr_pages)
 {
-	unsigned int i;
+	unsigned int i, unit, count;
 
 	might_sleep();
-	for (i = 0; i < nr_pages; i++) {
+	/*
+	 * When clearing we want to operate on the largest extent possible since
+	 * that allows for extent based architecture specific optimizations.
+	 *
+	 * However, since the clearing interfaces (clear_user_highpages(),
+	 * clear_user_pages(), clear_pages()), do not call cond_resched(), we
+	 * limit the batch size when running under non-preemptible scheduling
+	 * models.
+	 */
+	unit = preempt_model_preemptible() ? nr_pages : PROCESS_PAGES_NON_PREEMPT_BATCH;
+
+	for (i = 0; i < nr_pages; i += count) {
 		cond_resched();
 
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+		count = min(unit, nr_pages - i);
+		clear_user_highpages(page + i, addr + i * PAGE_SIZE, count);
 	}
 }
 
-- 
2.31.1



  parent reply	other threads:[~2026-01-07  7:36 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-07  7:20 [PATCH v11 0/8] " Ankur Arora
2026-01-07  7:20 ` [PATCH v11 1/8] treewide: provide a generic clear_user_page() variant Ankur Arora
2026-01-07  7:20 ` [PATCH v11 2/8] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
2026-01-07 22:06   ` David Hildenbrand (Red Hat)
2026-01-07  7:20 ` [PATCH v11 3/8] highmem: introduce clear_user_highpages() Ankur Arora
2026-01-07 22:08   ` David Hildenbrand (Red Hat)
2026-01-08  6:10     ` Ankur Arora
2026-01-07  7:20 ` [PATCH v11 4/8] x86/mm: Simplify clear_page_* Ankur Arora
2026-01-07  7:20 ` [PATCH v11 5/8] x86/clear_page: Introduce clear_pages() Ankur Arora
2026-01-07  7:20 ` [PATCH v11 6/8] mm: folio_zero_user: clear pages sequentially Ankur Arora
2026-01-07 22:10   ` David Hildenbrand (Red Hat)
2026-01-07  7:20 ` Ankur Arora [this message]
2026-01-07 22:16   ` [PATCH v11 7/8] mm: folio_zero_user: clear page ranges David Hildenbrand (Red Hat)
2026-01-08  0:44     ` Ankur Arora
2026-01-08  0:43   ` [PATCH] mm: folio_zero_user: (fixup) cache neighbouring pages Ankur Arora
2026-01-08  0:53     ` Ankur Arora
2026-01-08  6:04   ` [PATCH] mm: folio_zero_user: (fixup) cache page ranges Ankur Arora
2026-01-07  7:20 ` [PATCH v11 8/8] mm: folio_zero_user: cache neighbouring pages Ankur Arora
2026-01-07 22:18   ` David Hildenbrand (Red Hat)
2026-01-07 18:09 ` [PATCH v11 0/8] mm: folio_zero_user: clear page ranges Andrew Morton
2026-01-08  6:21   ` Ankur Arora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260107072009.1615991-8-ankur.a.arora@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=chleroy@kernel.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hpa@zytor.com \
    --cc=ioworker0@gmail.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizhe.67@bytedance.com \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox