Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: Raghavendra K T <raghavendra.kt@amd.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com,
	willy@infradead.org, mgorman@suse.de, peterz@infradead.org,
	rostedt@goodmis.org, tglx@linutronix.de,
	vincent.guittot@linaro.org, jon.grimm@amd.com, bharata@amd.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
Date: Sat, 08 Apr 2023 15:46:56 -0700	[thread overview]
Message-ID: <87ttxqf0v3.fsf@oracle.com> (raw)
In-Reply-To: <271b85ec-281e-d33b-5495-59eb2bc9fde4@amd.com>


Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/3/2023 10:52 AM, Ankur Arora wrote:
>> This series introduces multi-page clearing for hugepages.

>    *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>                            (GB/s)           (GB/s)
>   pg-sz=2MB                 12.24            17.54    +43.30%
>    pg-sz=1GB                17.98            37.24   +107.11%
>
>
> Hello Ankur,
>
> Was able to test your patches. To summarize, am seeing 2x-3x perf
> improvement for 2M, 1GB base hugepage sizes.

Great. Thanks Raghavendra.

> SUT: Genoa AMD EPYC
>    Thread(s) per core:  2
>    Core(s) per socket:  128
>    Socket(s):           2
>
> NUMA:
>   NUMA node(s):          2
>   NUMA node0 CPU(s):     0-127,256-383
>   NUMA node1 CPU(s):     128-255,384-511
>
> Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for
> both base-hugepage-size=2M and 1GB
>
> perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>
>
> time in seconds elapsed (average of 10 runs) (lower = better)
>
> Result:
> page-size  mm/clear_huge_page   x86/clear_huge_page
> 2M              5.4567          2.6774
> 1G              2.64452         1.011281

So translating into BW, for Genoa we have:

page-size  mm/clear_huge_page   x86/clear_huge_page
 2M              11.74              23.97
 1G              24.24              63.36

That's a pretty good bump over Milan:

>    *Milan*     mm/clear_huge_page   x86/clear_huge_page
>                            (GB/s)           (GB/s)
>   pg-sz=2MB                12.24            17.54
>   pg-sz=1GB                17.98            37.24

Btw, are these numbers with boost=1?

> Full perfstat info
>
>  page size = 2M mm/clear_huge_page
>
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
>           5,434.71 msec task-clock                #    0.996 CPUs utilized
>          ( +-  0.55% )
>                  8      context-switches          #    1.466 /sec
>                  ( +-  4.66% )
>                  0      cpu-migrations            #    0.000 /sec
>             32,918      page-faults               #    6.034 K/sec
>             ( +-  0.00% )
>     16,977,242,482      cycles                    #    3.112 GHz
>     ( +-  0.04% )  (35.70%)
>          1,961,724      stalled-cycles-frontend   #    0.01% frontend cycles
>         idle     ( +-  1.09% )  (35.72%)
>         35,685,674      stalled-cycles-backend    #    0.21% backend cycles idle
>        ( +-  3.48% )  (35.74%)
>      1,038,327,182      instructions              #    0.06  insn per cycle
>                                                   #    0.04  stalled cycles per
>                                                       insn  ( +-  0.38% )
>                                                       (35.75%)
>        221,409,216      branches                  #   40.584 M/sec
>        ( +-  0.36% )  (35.75%)
>            350,730      branch-misses             #    0.16% of all branches
>           ( +-  1.18% )  (35.75%)
>      2,520,888,779      L1-dcache-loads           #  462.077 M/sec
>      ( +-  0.03% )  (35.73%)
>      1,094,178,209      L1-dcache-load-misses     #   43.46% of all L1-dcache
>     accesses  ( +-  0.02% )  (35.71%)
>         67,751,730      L1-icache-loads           #   12.419 M/sec
>         ( +-  0.11% )  (35.70%)
>            271,118      L1-icache-load-misses     #    0.40% of all L1-icache
>           accesses  ( +-  2.55% )  (35.70%)
>            506,635      dTLB-loads                #   92.866 K/sec
>            ( +-  3.31% )  (35.70%)
>            237,385      dTLB-load-misses          #   43.64% of all dTLB cache
>           accesses  ( +-  7.00% )  (35.69%)
>                268      iTLB-load-misses          # 6700.00% of all iTLB cache
>               accesses  ( +- 13.86% )  (35.70%)
>
>             5.4567 +- 0.0300 seconds time elapsed  ( +-  0.55% )
>
>  page size = 2M x86/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs):
>
>           2,780.69 msec task-clock                #    1.039 CPUs utilized
>          ( +-  1.03% )
>                  3      context-switches          #    1.121 /sec
>                  ( +- 21.34% )
>                  0      cpu-migrations            #    0.000 /sec
>             32,918      page-faults               #   12.301 K/sec
>             ( +-  0.00% )
>      8,143,619,771      cycles                    #    3.043 GHz
>      ( +-  0.25% )  (35.62%)
>          2,024,872      stalled-cycles-frontend   #    0.02% frontend cycles
>         idle     ( +-320.93% )  (35.66%)
>        717,198,728      stalled-cycles-backend    #    8.82% backend cycles idle
>       ( +-  8.26% )  (35.69%)
>        606,549,334      instructions              #    0.07  insn per cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  0.23% )
>                                                       (35.73%)
>        108,856,550      branches                  #   40.677 M/sec
>        ( +-  0.24% )  (35.76%)
>            202,490      branch-misses             #    0.18% of all branches
>           ( +-  3.58% )  (35.78%)
>      2,348,818,806      L1-dcache-loads           #  877.701 M/sec
>      ( +-  0.03% )  (35.78%)
>      1,081,562,988      L1-dcache-load-misses     #   46.04% of all L1-dcache
>     accesses  ( +-  0.01% )  (35.78%)
>    <not supported>      LLC-loads
>    <not supported>      LLC-load-misses
>         43,411,167      L1-icache-loads           #   16.222 M/sec
>         ( +-  0.19% )  (35.77%)
>            273,042      L1-icache-load-misses     #    0.64% of all L1-icache
>           accesses  ( +-  4.94% )  (35.76%)
>            834,482      dTLB-loads                #  311.827 K/sec
>            ( +-  9.73% )  (35.72%)
>            437,343      dTLB-load-misses          #   65.86% of all dTLB cache
>           accesses  ( +-  8.56% )  (35.68%)
>                  0      iTLB-loads                #    0.000 /sec
>                  (35.65%)
>                160      iTLB-load-misses          # 1777.78% of all iTLB cache
>               accesses  ( +- 15.82% )  (35.62%)
>
>             2.6774 +- 0.0287 seconds time elapsed  ( +-  1.07% )
>
>  page size = 1G mm/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,625.24 msec task-clock                #    0.993 CPUs utilized
>          ( +-  0.23% )
>                  4      context-switches          #    1.513 /sec
>                  ( +-  4.49% )
>                  1      cpu-migrations            #    0.378 /sec
>                214      page-faults               #   80.965 /sec
>                ( +-  0.13% )
>      8,178,624,349      cycles                    #    3.094 GHz
>      ( +-  0.23% )  (35.65%)
>          2,942,576      stalled-cycles-frontend   #    0.04% frontend cycles
>         idle     ( +- 75.22% )  (35.69%)
>          7,117,425      stalled-cycles-backend    #    0.09% backend cycles idle
>         ( +-  3.79% )  (35.73%)
>        454,521,647      instructions              #    0.06  insn per cycle
>                                                   #    0.02  stalled cycles per
>                                                       insn  ( +-  0.10% )
>                                                       (35.77%)
>        113,223,853      branches                  #   42.837 M/sec
>        ( +-  0.08% )  (35.80%)
>             84,766      branch-misses             #    0.07% of all branches
>            ( +-  5.37% )  (35.80%)
>      2,294,528,890      L1-dcache-loads           #  868.111 M/sec
>      ( +-  0.02% )  (35.81%)
>      1,075,907,551      L1-dcache-load-misses     #   46.88% of all L1-dcache
>     accesses  ( +-  0.02% )  (35.78%)
>         26,167,323      L1-icache-loads           #    9.900 M/sec
>         ( +-  0.24% )  (35.74%)
>            139,675      L1-icache-load-misses     #    0.54% of all L1-icache
>           accesses  ( +-  0.37% )  (35.70%)
>              3,459      dTLB-loads                #    1.309 K/sec
>              ( +- 12.75% )  (35.67%)
>                732      dTLB-load-misses          #   19.71% of all dTLB cache
>               accesses  ( +- 26.61% )  (35.62%)
>                 11      iTLB-load-misses          #  192.98% of all iTLB cache
>                accesses  ( +-238.28% )  (35.62%)
>
>            2.64452 +- 0.00600 seconds time elapsed  ( +-  0.23% )
>
>
>  page size = 1G x86/clear_huge_page
>  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           1,009.09 msec task-clock                #    0.998 CPUs utilized
>          ( +-  0.06% )
>                  2      context-switches          #    1.980 /sec
>                  ( +- 23.63% )
>                  1      cpu-migrations            #    0.990 /sec
>                214      page-faults               #  211.887 /sec
>                ( +-  0.16% )
>      3,154,980,463      cycles                    #    3.124 GHz
>      ( +-  0.06% )  (35.77%)
>            145,051      stalled-cycles-frontend   #    0.00% frontend cycles
>           idle     ( +-  6.26% )  (35.78%)
>        730,087,143      stalled-cycles-backend    #   23.12% backend cycles idle
>       ( +-  9.75% )  (35.78%)
>         45,813,391      instructions              #    0.01  insn per cycle
>                                                   #   18.51  stalled cycles per
>                                                      insn  ( +-  1.00% )
>                                                      (35.78%)
>          8,498,282      branches                  #    8.414 M/sec
>          ( +-  1.54% )  (35.78%)
>             63,351      branch-misses             #    0.74% of all branches
>            ( +-  6.70% )  (35.69%)
>         29,135,863      L1-dcache-loads           #   28.848 M/sec
>         ( +-  5.67% )  (35.68%)
>          8,537,280      L1-dcache-load-misses     #   28.66% of all L1-dcache
>         accesses  ( +- 10.15% )  (35.68%)
>          1,040,087      L1-icache-loads           #    1.030 M/sec
>          ( +-  1.60% )  (35.68%)
>              9,147      L1-icache-load-misses     #    0.85% of all L1-icache
>             accesses  ( +-  6.50% )  (35.67%)
>              1,084      dTLB-loads                #    1.073 K/sec
>              ( +- 12.05% )  (35.68%)
>                431      dTLB-load-misses          #   40.28% of all dTLB cache
>               accesses  ( +- 43.46% )  (35.68%)
>                 16      iTLB-load-misses          #    0.00% of all iTLB cache
>                accesses  ( +- 40.54% )  (35.68%)
>
>           1.011281 +- 0.000624 seconds time elapsed  ( +-  0.06% )
>
> Please feel free to add
>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

Thanks

Ankur

> Will come back with further observations on patch/performance if any

next prev parent reply	other threads:[~2023-04-08 22:47 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-03  5:22 Ankur Arora
2023-04-03  5:22 ` [PATCH 1/9] huge_pages: get rid of process_huge_page() Ankur Arora
2023-04-03  5:22 ` [PATCH 2/9] huge_page: get rid of {clear,copy}_subpage() Ankur Arora
2023-04-03  5:22 ` [PATCH 3/9] huge_page: allow arch override for clear/copy_huge_page() Ankur Arora
2023-04-03  5:22 ` [PATCH 4/9] x86/clear_page: parameterize clear_page*() to specify length Ankur Arora
2023-04-06  8:19   ` Peter Zijlstra
2023-04-07  3:03     ` Ankur Arora
2023-04-03  5:22 ` [PATCH 5/9] x86/clear_pages: add clear_pages() Ankur Arora
2023-04-06  8:23   ` Peter Zijlstra
2023-04-07  0:50     ` Ankur Arora
2023-04-07 10:34       ` Peter Zijlstra
2023-04-09 13:26         ` Matthew Wilcox
2023-04-03  5:22 ` [PATCH 6/9] mm/clear_huge_page: use multi-page clearing Ankur Arora
2023-04-03  5:22 ` [PATCH 7/9] sched: define TIF_ALLOW_RESCHED Ankur Arora
2023-04-05 20:07   ` Peter Zijlstra
2023-04-03  5:22 ` [PATCH 8/9] irqentry: define irqentry_exit_allow_resched() Ankur Arora
2023-04-04  9:38   ` Thomas Gleixner
2023-04-05  5:29     ` Ankur Arora
2023-04-05 20:22   ` Peter Zijlstra
2023-04-06 16:56     ` Ankur Arora
2023-04-06 20:13       ` Peter Zijlstra
2023-04-06 20:16         ` Peter Zijlstra
2023-04-07  2:29         ` Ankur Arora
2023-04-07 10:23           ` Peter Zijlstra
2023-04-03  5:22 ` [PATCH 9/9] x86/clear_huge_page: make clear_contig_region() preemptible Ankur Arora
2023-04-05 20:27   ` Peter Zijlstra
2023-04-06 17:00     ` Ankur Arora
2023-04-05 19:48 ` [PATCH 0/9] x86/clear_huge_page: multi-page clearing Raghavendra K T
2023-04-08 22:46   ` Ankur Arora [this message]
2023-04-10  6:26     ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ttxqf0v3.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jon.grimm@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox