Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Raghavendra K T <raghavendra.kt@amd.com>
To: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, luto@kernel.org, peterz@infradead.org,
	paulmck@kernel.org, rostedt@goodmis.org, tglx@linutronix.de,
	willy@infradead.org, jon.grimm@amd.com, bharata@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
Date: Tue, 22 Apr 2025 11:53:06 +0530	[thread overview]
Message-ID: <0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com> (raw)
In-Reply-To: <20250414034607.762653-1-ankur.a.arora@oracle.com>



On 4/14/2025 9:16 AM, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages. It is a rework
> of [1] which took a detour through PREEMPT_LAZY [2].
> 
> Why multi-page clearing?: multi-page clearing improves upon the
> current page-at-a-time approach by providing the processor with a
> hint as to the real region size. A processor could use this hint to,
> for instance, elide cacheline allocation when clearing a large
> region.
> 
> This optimization in particular is done by REP; STOS on AMD Zen
> where regions larger than L3-size use non-temporal stores.
> 
> This results in significantly better performance.
> 
> We also see performance improvement for cases where this optimization is
> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
> REP; STOS is typically microcoded which can now be amortized over
> larger regions and the hint allows the hardware prefetcher to do a
> better job.
> 
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> 
>                   mm/folio_zero_user    x86/folio_zero_user     change
>                    (GB/s  +- stddev)      (GB/s  +- stddev)
> 
>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> 
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> 
>                   mm/folio_zero_user    x86/folio_zero_user     change
>                    (GB/s +- stddev)      (GB/s +- stddev)
> 
>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
> 
[...]

Hello Ankur,

Thank you for the patches. Was able to test briefly w/ lazy preempt
mode.

(I do understand that, there could be lot of churn based on Ingo,
Mateusz and others' comments)
But here it goes:

SUT: AMD EPYC 9B24 (Genoa) preempt=lazy

metric = time taken in sec (lower is better). total SIZE=64GB
                  mm/folio_zero_user    x86/folio_zero_user     change
   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57

More details (1G example run):

base kernel    =   6.14 (preempt = lazy)

mm/folio_zero_user
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):

           2,476.47 msec task-clock                       #    1.002 
CPUs utilized            ( +-  0.39% )
                  5      context-switches                 #    2.025 
/sec                     ( +- 29.70% )
                  2      cpu-migrations                   #    0.810 
/sec                     ( +- 21.15% )
                202      page-faults                      #   81.806 
/sec                     ( +-  0.18% )
      7,348,664,233      cycles                           #    2.976 GHz 
                      ( +-  0.38% )  (38.39%)
        878,805,326      stalled-cycles-frontend          #   11.99% 
frontend cycles idle     ( +-  0.74% )  (38.43%)
        339,023,729      instructions                     #    0.05 
insn per cycle
                                                   #    2.53  stalled 
cycles per insn  ( +-  0.08% )  (38.47%)
         88,579,915      branches                         #   35.873 
M/sec                    ( +-  0.06% )  (38.51%)
         17,369,776      branch-misses                    #   19.55% of 
all branches          ( +-  0.04% )  (38.55%)
      2,261,339,695      L1-dcache-loads                  #  915.795 
M/sec                    ( +-  0.06% )  (38.56%)
      1,073,880,164      L1-dcache-load-misses            #   47.48% of 
all L1-dcache accesses  ( +-  0.05% )  (38.56%)
          511,231,988      L1-icache-loads                  #  207.038 
M/sec                    ( +-  0.25% )  (38.52%)
            128,533      L1-icache-load-misses            #    0.02% of 
all L1-icache accesses  ( +-  0.40% )  (38.48%)
             38,134      dTLB-loads                       #   15.443 
K/sec                    ( +-  4.22% )  (38.44%)
             33,992      dTLB-load-misses                 #  114.39% of 
all dTLB cache accesses  ( +-  9.42% )  (38.40%)
                156      iTLB-loads                       #   63.177 
/sec                     ( +- 13.34% )  (38.36%)
                156      iTLB-load-misses                 #  102.50% of 
all iTLB cache accesses  ( +- 25.98% )  (38.36%)

            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )

x86/folio_zero_user
           1,056.72 msec task-clock                       #    0.996 
CPUs utilized            ( +-  0.07% )
                 10      context-switches                 #    9.436 
/sec                     ( +-  3.59% )
                  3      cpu-migrations                   #    2.831 
/sec                     ( +- 11.33% )
                200      page-faults                      #  188.718 
/sec                     ( +-  0.15% )
      3,146,571,264      cycles                           #    2.969 GHz 
                      ( +-  0.07% )  (38.35%)
         17,226,261      stalled-cycles-frontend          #    0.55% 
frontend cycles idle     ( +-  4.12% )  (38.44%)
         14,130,553      instructions                     #    0.00 
insn per cycle
                                                   #    1.39  stalled 
cycles per insn  ( +-  1.59% )  (38.53%)
          3,578,614      branches                         #    3.377 
M/sec                    ( +-  1.54% )  (38.62%)
            415,807      branch-misses                    #   12.45% of 
all branches          ( +-  1.17% )  (38.62%)
         22,208,699      L1-dcache-loads                  #   20.956 
M/sec                    ( +-  5.27% )  (38.60%)
          7,312,684      L1-dcache-load-misses            #   27.79% of 
all L1-dcache accesses  ( +-  8.46% )  (38.51%)
            4,032,315      L1-icache-loads                  #    3.805 
M/sec                    ( +-  1.29% )  (38.48%)
             15,094      L1-icache-load-misses            #    0.38% of 
all L1-icache accesses  ( +-  1.14% )  (38.39%)
             14,365      dTLB-loads                       #   13.555 
K/sec                    ( +-  7.23% )  (38.38%)
              9,477      dTLB-load-misses                 #   65.36% of 
all dTLB cache accesses  ( +- 12.05% )  (38.38%)
                 18      iTLB-loads                       #   16.985 
/sec                     ( +- 34.84% )  (38.38%)
                 67      iTLB-load-misses                 #  158.39% of 
all iTLB cache accesses  ( +- 48.32% )  (38.32%)

           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )

Thanks and Regards
- Raghu

next prev parent reply	other threads:[~2025-04-22  6:23 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-14  3:46 Ankur Arora
2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
2025-04-14  6:32   ` Ingo Molnar
2025-04-14 11:02     ` Peter Zijlstra
2025-04-14 11:14       ` Ingo Molnar
2025-04-14 19:46       ` Ankur Arora
2025-04-14 22:26       ` Mateusz Guzik
2025-04-15  6:14         ` Ankur Arora
2025-04-15  8:22           ` Mateusz Guzik
2025-04-15 20:01             ` Ankur Arora
2025-04-15 20:32               ` Mateusz Guzik
2025-04-14 19:52     ` Ankur Arora
2025-04-14 20:09       ` Matthew Wilcox
2025-04-15 21:59         ` Ankur Arora
2025-04-14  3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
2025-04-14  6:53   ` Ingo Molnar
2025-04-14 21:21     ` Ankur Arora
2025-04-14  7:05   ` Ingo Molnar
2025-04-15  6:36     ` Ankur Arora
2025-04-22  6:36     ` Raghavendra K T
2025-04-22 19:14       ` Ankur Arora
2025-04-15 10:16   ` Mateusz Guzik
2025-04-15 21:46     ` Ankur Arora
2025-04-15 22:01       ` Mateusz Guzik
2025-04-16  4:46         ` Ankur Arora
2025-04-17 14:06           ` Mateusz Guzik
2025-04-14  5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
2025-04-14 19:30   ` Ankur Arora
2025-04-14  6:36 ` Ingo Molnar
2025-04-14 19:19   ` Ankur Arora
2025-04-15 19:10 ` Zi Yan
2025-04-22 19:32   ` Ankur Arora
2025-04-22  6:23 ` Raghavendra K T [this message]
2025-04-22 19:22   ` Ankur Arora
2025-04-23  8:12     ` Raghavendra K T
2025-04-23  9:18       ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com \
    --to=raghavendra.kt@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankur.a.arora@oracle.com \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jon.grimm@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox