linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ankur Arora <ankur.a.arora@oracle.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: akpm@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
	mjguzik@gmail.com, luto@kernel.org, peterz@infradead.org,
	acme@kernel.org, namhyung@kernel.org, tglx@linutronix.de,
	willy@infradead.org, jon.grimm@amd.com, bharata@amd.com,
	raghavendra.kt@amd.com, boris.ostrovsky@oracle.com,
	konrad.wilk@oracle.com, ankur.a.arora@oracle.com
Subject: [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing
Date: Sun, 15 Jun 2025 22:22:23 -0700	[thread overview]
Message-ID: <20250616052223.723982-14-ankur.a.arora@oracle.com> (raw)
In-Reply-To: <20250616052223.723982-1-ankur.a.arora@oracle.com>

Override the common code version of folio_zero_user() so we can use
clear_pages() to do multi-page clearing instead of the standard
page-at-a-time clearing. This allows us to advertise the full
region-size to the processor, which when using string instructions
(REP; STOS), can use the knowledge of the extent to optimize the
clearing.

Apart from this we have two other considerations: cache locality when
clearing 2MB pages, and preemption latency when clearing GB pages.

The first is handled by breaking the clearing in three parts: the
faulting page and its immediate locality, its left and right regions;
with the local neighbourhood cleared last.

The second is only an issue for kernels running under cooperative
preemption. Limit the worst case preemption latency by clearing in
PAGE_RESCHED_CHUNK (8MB) units.

The resultant performance falls in two buckets depending on the kinds of
optimizations that the uarch can do for the clearing extent. Two classes
of optimizations:

  - amortize each clearing iteration over a large range instead of at
    a page granularity.
  - cacheline allocation elision (seen only on AMD Zen models)

A demand fault workload shows that the resultant performance falls in two
buckets depending on if the extent being zeroed is large enough to allow
for cacheline allocation elision.

AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

 $ perf bench mem map -p $page-size -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

  pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%
  pg-sz=1GB       17.51  +- 1.19%        40.03  +-  7.26% [#]   +129.9%

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than PAGE_RESCHED_CHUNK, so
preempt=none|voluntary sees no improvement for this test.

  pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98%        + 1.6%

The dropoff in cacheline allocations for pg-sz=1GB can be seen with
perf-stat:

   - 44,513,459,667      cycles                           #    2.420 GHz                         ( +-  0.44% )  (35.71%)
   -  1,378,032,592      instructions                     #    0.03  insn per cycle
   - 11,224,288,082      L1-dcache-loads                  #  610.187 M/sec                       ( +-  0.08% )  (35.72%)
   -  5,373,473,118      L1-dcache-load-misses            #   47.87% of all L1-dcache accesses   ( +-  0.00% )  (35.71%)

   + 20,093,219,076      cycles                           #    2.421 GHz                         ( +-  3.64% )  (35.69%)
   +  1,378,032,592      instructions                     #    0.03  insn per cycle
   +    186,525,095      L1-dcache-loads                  #   22.479 M/sec                       ( +-  2.11% )  (35.74%)
   +     73,479,687      L1-dcache-load-misses            #   39.39% of all L1-dcache accesses   ( +-  3.03% )  (35.74%)

Also note that as mentioned earlier, this improvement is not specific to
AMD Zen*. Intel Icelakex (pg-sz=2MB|1GB) sees a similar improvement as
the Milan pg-sz=2MB workload above (~35%).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/mm/Makefile |  1 +
 arch/x86/mm/memory.c | 97 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 arch/x86/mm/memory.c

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 5b9908f13dcf..9031faf21849 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
 obj-$(CONFIG_NUMA)		+= numa.o
 obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat.o
+obj-$(CONFIG_PREEMPTION)	+= memory.o
 
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
new file mode 100644
index 000000000000..a799c0cc3c5f
--- /dev/null
+++ b/arch/x86/mm/memory.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/mm.h>
+#include <linux/range.h>
+#include <linux/minmax.h>
+
+/*
+ * Limit the optimized version of folio_zero_user() to !CONFIG_HIGHMEM.
+ * We do that because clear_pages() works on contiguous kernel pages
+ * which might not be true under HIGHMEM.
+ */
+#ifndef CONFIG_HIGHMEM
+/*
+ * For voluntary preemption models, operate with a max chunk-size of 8MB.
+ * (Worst case resched latency of ~1ms, with a clearing BW of ~10GBps.)
+ */
+#define PAGE_RESCHED_CHUNK	(8 << (20 - PAGE_SHIFT))
+
+static void clear_pages_resched(void *addr, int npages)
+{
+	int i, remaining;
+
+	if (preempt_model_preemptible()) {
+		clear_pages(addr, npages);
+		goto out;
+	}
+
+	for (i = 0; i < npages/PAGE_RESCHED_CHUNK; i++) {
+		clear_pages(addr + i * PAGE_RESCHED_CHUNK * PAGE_SIZE, PAGE_RESCHED_CHUNK);
+		cond_resched();
+	}
+
+	remaining = npages % PAGE_RESCHED_CHUNK;
+
+	if (remaining)
+		clear_pages(addr + i * PAGE_RESCHED_CHUNK * PAGE_SHIFT, remaining);
+out:
+	cond_resched();
+}
+
+/*
+ * folio_zero_user() - multi-page clearing.
+ *
+ * @folio: hugepage folio
+ * @addr_hint: faulting address (if any)
+ *
+ * Overrides common code folio_zero_user(). This version takes advantage of
+ * the fact that string instructions in clear_pages() are more performant
+ * on larger extents compared to the usual page-at-a-time clearing.
+ *
+ * Clearing of 2MB pages is split in three parts: pages in the immediate
+ * locality of the faulting page, and its left, right regions; with the local
+ * neighbourhood cleared last in order to keep cache lines of the target
+ * region hot.
+ *
+ * For GB pages, there is no expectation of cache locality so just do a
+ * straight zero.
+ *
+ * Note that the folio is fully allocated already so we don't do any exception
+ * handling.
+ */
+void folio_zero_user(struct folio *folio, unsigned long addr_hint)
+{
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
+
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_pages_resched(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		int npages = range_len(&r[i]);
+
+		if (npages > 0)
+			clear_pages_resched(page_address(folio_page(folio, r[i].start)), npages);
+	}
+}
+#endif /* CONFIG_HIGHMEM */
-- 
2.31.1



  parent reply	other threads:[~2025-06-16  5:23 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-16  5:22 [PATCH v4 00/13] x86/mm: " Ankur Arora
2025-06-16  5:22 ` [PATCH v4 01/13] perf bench mem: Remove repetition around time measurement Ankur Arora
2025-06-16  5:22 ` [PATCH v4 02/13] perf bench mem: Defer type munging of size to float Ankur Arora
2025-06-16  5:22 ` [PATCH v4 03/13] perf bench mem: Move mem op parameters into a structure Ankur Arora
2025-06-16  5:22 ` [PATCH v4 04/13] perf bench mem: Pull out init/fini logic Ankur Arora
2025-06-16  5:22 ` [PATCH v4 05/13] perf bench mem: Switch from zalloc() to mmap() Ankur Arora
2025-06-16  5:22 ` [PATCH v4 06/13] perf bench mem: Allow mapping of hugepages Ankur Arora
2025-06-16  5:22 ` [PATCH v4 07/13] perf bench mem: Allow chunking on a memory region Ankur Arora
2025-06-16  5:22 ` [PATCH v4 08/13] perf bench mem: Refactor mem_options Ankur Arora
2025-06-16  5:22 ` [PATCH v4 09/13] perf bench mem: Add mmap() workloads Ankur Arora
2025-06-16  5:22 ` [PATCH v4 10/13] x86/mm: Simplify clear_page_* Ankur Arora
2025-06-16 14:35   ` Dave Hansen
2025-06-16 14:38     ` Peter Zijlstra
2025-06-16 18:18     ` Ankur Arora
2025-06-16 16:48   ` kernel test robot
2025-06-16  5:22 ` [PATCH v4 11/13] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-06-16  5:22 ` [PATCH v4 12/13] mm: memory: allow arch override for folio_zero_user() Ankur Arora
2025-06-16  5:22 ` Ankur Arora [this message]
2025-06-16 11:39   ` [PATCH v4 13/13] x86/folio_zero_user: Add multi-page clearing kernel test robot
2025-06-16 14:44   ` Dave Hansen
2025-06-16 14:50     ` Peter Zijlstra
2025-06-16 15:03       ` Dave Hansen
2025-06-16 18:20       ` Ankur Arora
2025-06-16 14:58     ` Matthew Wilcox
2025-06-16 18:47     ` Ankur Arora
2025-06-19 23:51       ` Ankur Arora
2025-06-16 15:06 ` [PATCH v4 00/13] x86/mm: " Dave Hansen
2025-06-16 18:25   ` Ankur Arora
2025-06-16 18:30     ` Dave Hansen
2025-06-16 18:43       ` Ankur Arora
2025-07-04  8:15 ` Raghavendra K T
2025-07-07 21:02   ` Ankur Arora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250616052223.723982-14-ankur.a.arora@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=acme@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jon.grimm@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox