From: Shakeel Butt <shakeelb@google.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: kernel test robot <oliver.sang@intel.com>,
oe-lkp@lists.linux.dev, lkp@intel.com,
linux-kernel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Marek Szyprowski <m.szyprowski@samsung.com>,
linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org,
ying.huang@intel.com, feng.tang@intel.com,
zhengjun.xing@linux.intel.com, fengwei.yin@intel.com
Subject: Re: [linus:master] [mm] f1a7941243: unixbench.score -19.2% regression
Date: Tue, 31 Jan 2023 05:23:52 +0000 [thread overview]
Message-ID: <20230131052352.5qnqegzwmt7akk7t@google.com> (raw)
In-Reply-To: <Y9dETROtv9Bld9TI@casper.infradead.org>
On Mon, Jan 30, 2023 at 04:15:09AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 10:32:56AM +0800, kernel test robot wrote:
> > FYI, we noticed a -19.2% regression of unixbench.score due to commit:
> >
> > commit: f1a7941243c102a44e8847e3b94ff4ff3ec56f25 ("mm: convert mm's rss stats into percpu_counter")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > in testcase: unixbench
> > on test machine: 128 threads 4 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz (Ice Lake) with 256G memory
> > with following parameters:
> >
> > runtime: 300s
> > nr_task: 30%
> > test: spawn
> > cpufreq_governor: performance
>
> ...
>
> > 9cd6ffa60256e931 f1a7941243c102a44e8847e3b94
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 11110 -19.2% 8974 unixbench.score
> > 1090843 -12.2% 957314 unixbench.time.involuntary_context_switches
> > 4243909 ± 6% -32.4% 2867136 ± 5% unixbench.time.major_page_faults
> > 10547 -12.6% 9216 unixbench.time.maximum_resident_set_size
> > 9.913e+08 -19.6% 7.969e+08 unixbench.time.minor_page_faults
> > 5638 +19.1% 6714 unixbench.time.system_time
> > 5502 -20.7% 4363 unixbench.time.user_time
>
> So we're spending a lot more time in the kernel and correspondingly less
> time in userspace.
>
> > 67991885 -16.9% 56507507 unixbench.time.voluntary_context_switches
> > 46198768 -19.1% 37355723 unixbench.workload
> > 1.365e+08 -12.5% 1.195e+08 ± 7% cpuidle..usage
> > 1220612 ± 4% -38.0% 757009 ± 28% meminfo.Active
> > 1220354 ± 4% -38.0% 756754 ± 28% meminfo.Active(anon)
> > 0.50 ± 2% -0.1 0.45 ± 4% mpstat.cpu.all.soft%
> > 1.73 -0.2 1.52 ± 2% mpstat.cpu.all.usr%
> > 532266 -18.4% 434559 vmstat.system.cs
> > 495826 -12.2% 435455 ± 8% vmstat.system.in
> > 1.36e+08 -13.2% 1.18e+08 ± 9% turbostat.C1
> > 68.80 +0.8 69.60 turbostat.C1%
> > 1.663e+08 -12.1% 1.462e+08 ± 8% turbostat.IRQ
> > 15.54 ± 20% -49.0% 7.93 ± 24% sched_debug.cfs_rq:/.runnable_avg.min
> > 13.26 ± 19% -46.6% 7.08 ± 29% sched_debug.cfs_rq:/.util_avg.min
> > 48.96 ± 8% +51.5% 74.20 ± 13% sched_debug.cfs_rq:/.util_est_enqueued.avg
> > 138.00 ± 5% +28.9% 177.87 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
> > 228060 ± 3% +13.3% 258413 ± 4% sched_debug.cpu.avg_idle.stddev
> > 432533 ± 5% -16.4% 361517 ± 4% sched_debug.cpu.nr_switches.min
> > 2.665e+08 -18.9% 2.162e+08 numa-numastat.node0.local_node
> > 2.666e+08 -18.9% 2.163e+08 numa-numastat.node0.numa_hit
> > 2.746e+08 -20.9% 2.172e+08 numa-numastat.node1.local_node
> > 2.747e+08 -20.9% 2.172e+08 numa-numastat.node1.numa_hit
> > 2.602e+08 -17.4% 2.149e+08 numa-numastat.node2.local_node
> > 2.603e+08 -17.4% 2.149e+08 numa-numastat.node2.numa_hit
> > 2.423e+08 -15.0% 2.06e+08 numa-numastat.node3.local_node
> > 2.424e+08 -15.0% 2.061e+08 numa-numastat.node3.numa_hit
>
> So we're going off-node a lot more for ... something.
>
> > 2.666e+08 -18.9% 2.163e+08 numa-vmstat.node0.numa_hit
> > 2.665e+08 -18.9% 2.162e+08 numa-vmstat.node0.numa_local
> > 2.747e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_hit
> > 2.746e+08 -20.9% 2.172e+08 numa-vmstat.node1.numa_local
> > 2.603e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_hit
> > 2.602e+08 -17.4% 2.149e+08 numa-vmstat.node2.numa_local
> > 2.424e+08 -15.0% 2.061e+08 numa-vmstat.node3.numa_hit
> > 2.423e+08 -15.0% 2.06e+08 numa-vmstat.node3.numa_local
> > 304947 ± 4% -38.0% 189144 ± 28% proc-vmstat.nr_active_anon
>
> Umm. Are we running vmstat a lot during this test? The commit says:
>
> At the
> moment the readers are either procfs interface, oom_killer and memory
> reclaim which I think are not performance critical and should be ok with
> slow read. However I think we can make that change in a separate patch.
>
> This would explain the increased cross-NUMA references (we're going to
> the other nodes to collect the stats), and the general slowdown. But I
> don't think it reflects a real workload; it's reflecting that the
> monitoring of this workload that we're doing is now more accurate and
> more expensive.
>
Thanks Willy for taking a stab at this issue. The numa_hit stat is
updated on allocations, so I don't think stat collection would increase
these stats. I looked at workload "spawn" in UnixBench and it is a
simple fork ping pong i.e. process does fork and then waits for the
child while the child simply exits.
I ran perf and it seems like percpu counter allocation is the additional
cost with this patch. See the report below. However I made spawn a bit
more sophisticated by adding a mmap() of a GiB then the page table
copy became the significant cost and no difference without or with the
given patch.
I am now wondering if this fork ping pong really an important workload
that we should revert the patch or ignore for now but work on improving
the performance of __alloc_percpu_gfp code.
- 90.97% 0.06% spawn [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
- 90.91% entry_SYSCALL_64_after_hwframe
- 90.86% do_syscall_64
- 80.03% __x64_sys_clone
- 79.98% kernel_clone
- 75.97% copy_process
+ 46.04% perf_event_init_task
- 21.50% copy_mm
- 10.05% mm_init
----------------------> - 8.92% __percpu_counter_init
- 8.67% __alloc_percpu_gfp
- 5.70% pcpu_alloc
1.29% _find_next_bit
2.57% memset_erms
+ 0.96% pgd_alloc
+ 6.16% copy_page_range
+ 1.72% anon_vma_fork
+ 0.87% mas_store
0.72% kmem_cache_alloc
+ 2.71% dup_task_struct
+ 1.37% perf_event_fork
0.63% alloc_pid
0.51% copy_files
+ 3.71% wake_up_new_task
+ 7.40% __x64_sys_exit_group
+ 2.32% __x64_sys_wait4
+ 1.03% syscall_exit_to_user_mode
next prev parent reply other threads:[~2023-01-31 5:24 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-30 2:32 kernel test robot
2023-01-30 4:15 ` Matthew Wilcox
2023-01-31 5:23 ` Shakeel Butt [this message]
2023-01-31 5:45 ` Matthew Wilcox
2023-01-31 5:57 ` Shakeel Butt
2023-01-31 18:26 ` Shakeel Butt
2023-02-27 6:35 ` Yin, Fengwei
2023-02-27 16:50 ` Shakeel Butt
2023-02-28 0:32 ` Yin, Fengwei
2023-01-31 6:11 ` Feng Tang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230131052352.5qnqegzwmt7akk7t@google.com \
--to=shakeelb@google.com \
--cc=akpm@linux-foundation.org \
--cc=feng.tang@intel.com \
--cc=fengwei.yin@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=lkp@intel.com \
--cc=m.szyprowski@samsung.com \
--cc=oe-lkp@lists.linux.dev \
--cc=oliver.sang@intel.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
--cc=zhengjun.xing@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox