linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Ankur Arora <ankur.a.arora@oracle.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	david@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, mingo@redhat.com, mjguzik@gmail.com,
	luto@kernel.org, peterz@infradead.org, tglx@linutronix.de,
	willy@infradead.org, raghavendra.kt@amd.com, chleroy@kernel.org,
	ioworker0@gmail.com, boris.ostrovsky@oracle.com,
	konrad.wilk@oracle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
Date: Mon, 15 Dec 2025 22:49:25 -0800	[thread overview]
Message-ID: <874ipqexai.fsf@oracle.com> (raw)
In-Reply-To: <20251215184413.19589400a74c2aadb42a2eca@linux-foundation.org>


Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a single page at a time. Exposing larger ranges enables extent based
>> processor optimizations.
>>
>> However, because the underlying clearing primitives do not, or might
>> not be able to check to call cond_resched() to check if preemption
>> is required, limit the worst case preemption latency by doing the
>> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
>>
>> For architectures that define clear_pages(), we assume that the
>> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
>> worth of pages. This should be large enough to allow the processor
>> to optimize the operation and yet small enough that we see reasonable
>> preemption latency for when this optimization is not possible
>> (ex. slow microarchitectures, memory bandwidth saturation.)
>>
>> Architectures that don't define clear_pages() will continue to use
>> the base value (single page). And, preemptible models don't need
>> invocations of cond_resched() so don't care about the batch size.
>>
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region size being cleared. Two classes
>> of optimizations:
>>
>>   - clearing iteration costs are amortized over a range larger
>>     than a single page.
>>   - cacheline allocation elision (seen on AMD Zen models).
>
> 8MB is a big chunk of memory.
>
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>>
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>
> So we break out of the copy to run cond_resched() 8192 times?  This sounds
> like a minor cost.
>
>>   $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>>
>>                     page-at-a-time     contiguous clearing      change
>>
>>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>
>>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
>>
>>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
>>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy
>
> And yet those 8192 cond_resched()'s have a huge impact on the
> performance!  I find this result very surprising.  Is it explainable?

I agree about this being surprising. On the 2MB extent, I still find the
30% quite high but I think a decent portion of it is:

  - on x86, the CPU is executing a single microcoded insn: REP; STOSB. And,
    because it's doing it for a 2MB instead of a bunch of 4K extents it
    saves the microcoding costs (and I suspect it allows it to do some
    range operation which also helps.)

  - the second reason (from Ingo) was again the per-iteration cost, which
    given all of the mitigations on x86 is quite substantial.

    On the AMD systems I had tested on, I think there's at least the cost
    of RET misprediction in there.

    (https://lore.kernel.org/lkml/Z_yzshvBmYiPrxU0@gmail.com/)

>>  [#] Notice that we perform much better with preempt=full|lazy. As
>>   mentioned above, preemptible models not needing explicit invocations
>>   of cond_resched() allow clearing of the full extent (1GB) as a
>>   single unit.
>>   In comparison the maximum extent used for preempt=none|voluntary is
>>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>>
>>   The larger extent allows the processor to elide cacheline
>>   allocation (on Milan the threshold is LLC-size=32MB.)
>
> It is this?

Yeah I think so. For size >= 32MB, the microcoder can really just elide
cacheline allocation, and with the foreknowledge of the extent can perhaps
optimize on cache coherence traffic (this last one is my speculation).

On cacheline allocation elision, compare the L1-dcache-load in the two versions
below:

pg-sz=1GB:
  -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
  -    544,878,976      instructions                     #    0.06  insn per cycle
  -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
  -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)

  +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
  +     10,979,121      instructions                     #    0.00  insn per cycle
  +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
  +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)

(From an earlier version of this series: https://lore.kernel.org/lkml/20250414034607.762653-5-ankur.a.arora@oracle.com/)

Maybe I should have kept it in this commit :).

>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>>

--
ankur


  reply	other threads:[~2025-12-16  6:50 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-15 20:49 [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Ankur Arora
2025-12-15 20:49 ` [PATCH v10 1/8] treewide: provide a generic clear_user_page() variant Ankur Arora
2025-12-18  7:11   ` David Hildenbrand (Red Hat)
2025-12-18 19:31     ` Ankur Arora
2025-12-15 20:49 ` [PATCH v10 2/8] highmem: introduce clear_user_highpages() Ankur Arora
2025-12-15 20:49 ` [PATCH v10 3/8] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
2025-12-15 20:49 ` [PATCH v10 4/8] highmem: do range clearing in clear_user_highpages() Ankur Arora
2025-12-18  7:15   ` David Hildenbrand (Red Hat)
2025-12-18 20:01     ` Ankur Arora
2025-12-15 20:49 ` [PATCH v10 5/8] x86/mm: Simplify clear_page_* Ankur Arora
2025-12-15 20:49 ` [PATCH v10 6/8] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-12-18  7:22   ` David Hildenbrand (Red Hat)
2025-12-15 20:49 ` [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges Ankur Arora
2025-12-16  2:44   ` Andrew Morton
2025-12-16  6:49     ` Ankur Arora [this message]
2025-12-16 15:12       ` Andrew Morton
2025-12-17  8:48         ` Ankur Arora
2025-12-17 18:54           ` Andrew Morton
2025-12-17 19:51             ` Ankur Arora
2025-12-17 20:26               ` Andrew Morton
2025-12-18  0:51                 ` Ankur Arora
2025-12-18  7:36   ` David Hildenbrand (Red Hat)
2025-12-18 20:16     ` Ankur Arora
2025-12-15 20:49 ` [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages Ankur Arora
2025-12-18  7:49   ` David Hildenbrand (Red Hat)
2025-12-18 21:01     ` Ankur Arora
2025-12-18 21:23       ` Ankur Arora
2025-12-23 10:11         ` David Hildenbrand (Red Hat)
2025-12-16  2:48 ` [PATCH v10 0/8] mm: folio_zero_user: clear contiguous pages Andrew Morton
2025-12-16  5:04   ` Ankur Arora
2025-12-18  7:38     ` David Hildenbrand (Red Hat)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874ipqexai.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=chleroy@kernel.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hpa@zytor.com \
    --cc=ioworker0@gmail.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox