[PATCH v9 6/7] mm, folio_zero_user: support clearing page ranges

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: akpm@linux-foundation.org, david@kernel.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
	mjguzik@gmail.com, luto@kernel.org, peterz@infradead.org,
	tglx@linutronix.de, willy@infradead.org, raghavendra.kt@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com,
	ankur.a.arora@oracle.com
Subject: [PATCH v9 6/7] mm, folio_zero_user: support clearing page ranges
Date: Fri, 21 Nov 2025 12:23:51 -0800	[thread overview]
Message-ID: <20251121202352.494700-7-ankur.a.arora@oracle.com> (raw)
In-Reply-To: <20251121202352.494700-1-ankur.a.arora@oracle.com>

Clear contiguous page ranges in folio_zero_user() instead of clearing
a single page at a time. Exposing larger ranges enables extent based
processor optimizations.

However, because the underlying clearing primitives do not, or might
not be able to check to call cond_resched() to check if preemption
is required, limit the worst case preemption latency by doing the
clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.

For architectures that define clear_pages(), we assume that the
clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
worth of pages. This should be large enough to allow the processor
to optimize the operation and yet small enough that we see reasonable
preemption latency for when this optimization is not possible
(ex. slow microarchitectures, memory bandwidth saturation.)

Architectures that don't define clear_pages() will continue to use
the base value (single page). And, preemptible models don't need
invocations of cond_resched() so don't care about the batch size.

The resultant performance depends on the kinds of optimizations
available to the CPU for the region size being cleared. Two classes
of optimizations:

  - clearing iteration costs are amortized over a range larger
    than a single page.
  - cacheline allocation elision (seen on AMD Zen models).

Testing a demand fault workload shows an improved baseline from the
first optimization and a larger improvement when the region being
cleared is large enough for the second optimization.

AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

  $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5

                    page-at-a-time     contiguous clearing      change

                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*

   pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
   pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy

 [#] Notice that we perform much better with preempt=full|lazy. As
  mentioned above, preemptible models not needing explicit invocations
  of cond_resched() allows us to clear the full extent (1GB) in a
  single unit.
  In comparison the maximum extent used for preempt=none|voluntary is
  PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).

  The larger extent allows the processor to elide cacheline
  allocation (on Milan the threshold is LLC-size=32MB.) 

Also as mentioned earlier, the baseline improvement is not specific to
AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
improvement as the Milan pg-sz=2MB workload above (~30%).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---

Notes:
   - Define PROCESS_PAGES_NON_PREEMPT_BATCH in common code (instead of
     inheriting ARCH_PAGE_CONTIG_NR.)
     - Also document this in much greater detail as clearing pages
       needing a a constant dependent on the preemption model is facially
       quite odd.
   - Minor loop reorganization in clear_contig_highpages().

 include/linux/mm.h | 35 +++++++++++++++++++++++++++++++++++
 mm/memory.c        | 46 +++++++++++++++++++++++++---------------------
 2 files changed, 60 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c397ee2f6dd5..ac20d2022cdf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3889,6 +3889,15 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
  * mapped to user space.
  *
  * Does absolutely no exception handling.
+ *
+ * Note that even though the clearing operation is preemptible, clear_pages()
+ * does not (and on architectures where it reduces to a few long-running
+ * but preemptible instructions, might not be able to) call cond_resched()
+ * to check if rescheduling is required.
+ *
+ * Running under preemptible models this is not a problem. Under cooperatively
+ * scheduled preemption models, however, the caller is expected to limit
+ * @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
  */
 static inline void clear_pages(void *addr, unsigned int npages)
 {
@@ -3899,6 +3908,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
 }
 #endif
 
+#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
+#ifdef clear_pages
+/*
+ * The architecture defines clear_pages(), and we assume that it is
+ * generally "fast". So choose a batch size large enough to allow the processor
+ * headroom for optimizing the operation and yet small enough that we see
+ * reasonable preemption latency for when this optimization is not possible
+ * (ex. slow microarchitectures, memory bandwidth saturation.)
+ *
+ * With a value of 8MB and assuming a memory bandwidth of ~10GBps, this should
+ * result in worst case preemption latency of around 1ms when clearing pages.
+ *
+ * (See comment above clear_pages() for why preemption latency is a concern
+ * here.)
+ */
+#define PROCESS_PAGES_NON_PREEMPT_BATCH		(8 << (20 - PAGE_SHIFT))
+#else /* !clear_pages */
+/*
+ * The architecture does not provide a clear_pages() implementation. Assume
+ * that clear_page() -- which clear_pages() will fallback to -- is relatively
+ * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
+ */
+#define PROCESS_PAGES_NON_PREEMPT_BATCH		1
+#endif
+#endif
+
 #ifndef clear_user_page
 /**
  * clear_user_page() - clear a page to be mapped to user space
diff --git a/mm/memory.c b/mm/memory.c
index b59ae7ce42eb..5e78af316647 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7162,40 +7162,44 @@ static inline int process_huge_page(
 	return 0;
 }
 
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
-				unsigned int nr_pages)
+static void clear_contig_highpages(struct page *page, unsigned long addr,
+				   unsigned int npages)
 {
-	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
-	int i;
+	unsigned int i, count, unit;
 
-	might_sleep();
-	for (i = 0; i < nr_pages; i++) {
+	/*
+	 * When clearing we want to operate on the largest extent possible since
+	 * that allows for extent based architecture specific optimizations.
+	 *
+	 * However, since the clearing interfaces (clear_user_highpages(),
+	 * clear_user_pages(), clear_pages()), do not call cond_resched(), we
+	 * limit the batch size when running under non-preemptible scheduling
+	 * models.
+	 */
+	unit = preempt_model_preemptible() ? npages : PROCESS_PAGES_NON_PREEMPT_BATCH;
+
+	for (i = 0; i < npages; i += count) {
 		cond_resched();
-		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
+
+		count = min(unit, npages - i);
+		clear_user_highpages(page + i,
+				     addr + i * PAGE_SIZE, count);
 	}
 }
 
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
-	struct folio *folio = arg;
-
-	clear_user_highpage(folio_page(folio, idx), addr);
-	return 0;
-}
-
 /**
  * folio_zero_user - Zero a folio which will be mapped to userspace.
  * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support to clear page ranges.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
-	unsigned int nr_pages = folio_nr_pages(folio);
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
 
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
-	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+	clear_contig_highpages(folio_page(folio, 0),
+				base_addr, folio_nr_pages(folio));
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.31.1

next prev parent reply	other threads:[~2025-11-21 20:24 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-21 20:23 [PATCH v9 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
2025-11-21 20:23 ` [PATCH v9 1/7] treewide: provide a generic clear_user_page() variant Ankur Arora
2025-11-23 11:53   ` Christophe Leroy (CS GROUP)
2025-11-24 10:17     ` David Hildenbrand (Red Hat)
2025-11-24 14:02       ` David Hildenbrand (Red Hat)
2025-11-25  7:52       ` Ankur Arora
2025-11-27 23:57         ` Ankur Arora
2025-11-28  7:39           ` Christophe Leroy (CS GROUP)
2025-11-28 22:19             ` Ankur Arora
2025-11-21 20:23 ` [PATCH v9 2/7] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
2025-11-23 13:17   ` Christophe Leroy (CS GROUP)
2025-11-24 10:26     ` David Hildenbrand (Red Hat)
2025-11-28 10:13       ` Lance Yang
2025-11-28 21:59         ` Ankur Arora
2025-11-21 20:23 ` [PATCH v9 3/7] mm/highmem: introduce clear_user_highpages() Ankur Arora
2025-11-21 20:23 ` [PATCH v9 4/7] x86/mm: Simplify clear_page_* Ankur Arora
2025-11-25 13:47   ` Borislav Petkov
2025-11-25 19:01     ` Ankur Arora
2025-11-26 10:01   ` Mateusz Guzik
2025-11-27  5:28     ` Ankur Arora
2025-11-21 20:23 ` [PATCH v9 5/7] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-11-21 20:23 ` Ankur Arora [this message]
2025-11-21 20:23 ` [PATCH v9 7/7] mm: folio_zero_user: cache neighbouring pages Ankur Arora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251121202352.494700-7-ankur.a.arora@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hpa@zytor.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox