From: Ivan Babrou <ivan@cloudflare.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: "Daniel Dao" <dqminh@cloudflare.com>,
kernel-team <kernel-team@cloudflare.com>,
"Linux MM" <linux-mm@kvack.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Roman Gushchin" <guro@fb.com>, "Feng Tang" <feng.tang@intel.com>,
"Michal Hocko" <mhocko@kernel.org>,
"Hillf Danton" <hdanton@sina.com>,
"Michal Koutný" <mkoutny@suse.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Linus Torvalds" <torvalds@linux-foundation.org>
Subject: Re: Regression in workingset_refault latency on 5.15
Date: Wed, 23 Feb 2022 11:28:39 -0800 [thread overview]
Message-ID: <CABWYdi0yzaKdOC2r6tgWODmQtHT_8NgaHQhrFEo_to0BWDKP2A@mail.gmail.com> (raw)
In-Reply-To: <CALvZod5g1o9f2qZcb-JkVAqdAmfN-TYyO_3KaK2iZh=4hmO56Q@mail.gmail.com>
> Can you please first confirm if you are running 5.15.19 kernel? If
> that is the case then it means the machine is seeing (nr_cpus * 32)
> stat updates.
Yes, we're on 5.15.19.
> Couple of questions:
> 1) Did you see any call stack other than xfs read where expensive
> mem_cgroup_flush_stats() is showing up?
Yes:
mem_cgroup_flush_stats: 7717 us
stack:
mem_cgroup_flush_stats+1
workingset_refault+296
__read_swap_cache_async+512
swapin_readahead+817
do_swap_page+561
__handle_mm_fault+2318
handle_mm_fault+191
do_user_addr_fault+428
exc_page_fault+103
asm_exc_page_fault+30
> 2) Can you please use the similar bpf+kprobe tracing for the
> memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the
> source of frequent stat updates.
"memcg_rstat_updated" is "static inline".
With the following:
bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }'
The top stacks are:
@stacks[
__mod_memcg_lruvec_state+1
__mod_lruvec_page_state+96
clear_page_dirty_for_io+292
write_cache_pages+323
iomap_writepages+28
xfs_vm_writepages+113
do_writepages+195
__writeback_single_inode+55
writeback_sb_inodes+493
__writeback_inodes_wb+76
]: 36429
@stacks[
__mod_memcg_lruvec_state+1
__mod_lruvec_page_state+96
__test_set_page_writeback+421
iomap_do_writepage+1241
write_cache_pages+393
iomap_writepages+28
xfs_vm_writepages+113
do_writepages+195
__writeback_single_inode+55
writeback_sb_inodes+493
]: 36431
@stacks[
__mod_memcg_lruvec_state+1
mod_objcg_state+421
memcg_slab_post_alloc_hook+345
__kmalloc_node_track_caller+386
__alloc_skb+139
alloc_skb_with_frags+75
sock_alloc_send_pskb+495
unix_stream_sendmsg+599
sock_sendmsg+94
sock_write_iter+151
]: 39459
@stacks[
__mod_memcg_lruvec_state+1
release_pages+760
tlb_finish_mmu+89
zap_page_range+265
do_madvise.part.0+1844
__x64_sys_madvise+69
do_syscall_64+59
entry_SYSCALL_64_after_hwframe+68
]: 54648
@stacks[
__mod_memcg_lruvec_state+1
__mod_lruvec_page_state+96
page_remove_rmap+85
unmap_page_range+1413
zap_page_range+224
do_madvise.part.0+1844
__x64_sys_madvise+69
do_syscall_64+59
entry_SYSCALL_64_after_hwframe+68
]: 54719
@stacks[
__mod_memcg_lruvec_state+1
__pagevec_lru_add+256
lru_cache_add+82
__handle_mm_fault+4686
handle_mm_fault+191
do_user_addr_fault+428
exc_page_fault+103
asm_exc_page_fault+30
]: 98103
@stacks[
__mod_memcg_lruvec_state+1
__mod_lruvec_page_state+96
page_add_new_anon_rmap+78
__handle_mm_fault+4675
handle_mm_fault+191
do_user_addr_fault+428
exc_page_fault+103
asm_exc_page_fault+30
]: 105071
> 3) I am still pondering why disabling swap resolves the issue for you.
> Is that only for a workload different from xfs read?
My understanding is that any block IO (including swap) triggers new
memcg accounting code. In our process we don't have any other IO than
swap, so disabling swap removes the major (if not only) vector of
triggering this issue.
> 4) Can you please also trace cgroup_rstat_flush_locked()?
Stacks are:
bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @stacks[kstack(10)]++ }'
Attaching 1 probe...
^C
@stacks[
cgroup_rstat_flush_locked+1
cgroup_rstat_flush_irqsafe+36
__mem_cgroup_flush_stats+51
shrink_node+163
balance_pgdat+777
kswapd+508
kthread+295
ret_from_fork+34
]: 1
@stacks[
cgroup_rstat_flush_locked+1
cgroup_rstat_flush_irqsafe+36
__mem_cgroup_flush_stats+51
workingset_refault+296
__read_swap_cache_async+512
swapin_readahead+817
do_swap_page+561
__handle_mm_fault+2318
handle_mm_fault+191
do_user_addr_fault+428
]: 5
@stacks[
cgroup_rstat_flush_locked+1
cgroup_rstat_flush_irqsafe+36
__mem_cgroup_flush_stats+51
workingset_refault+296
add_to_page_cache_lru+159
page_cache_ra_unbounded+340
filemap_get_pages+569
filemap_read+164
xfs_file_buffered_read+152
xfs_file_read_iter+106
]: 18
@stacks[
cgroup_rstat_flush_locked+1
cgroup_rstat_flush_irqsafe+36
__mem_cgroup_flush_stats+51
workingset_refault+296
add_to_page_cache_lru+159
page_cache_ra_unbounded+340
filemap_get_pages+233
filemap_read+164
xfs_file_buffered_read+152
xfs_file_read_iter+106
]: 31
@stacks[
cgroup_rstat_flush_locked+1
cgroup_rstat_flush_irqsafe+36
__mem_cgroup_flush_stats+51
workingset_refault+296
add_to_page_cache_lru+159
page_cache_ra_unbounded+340
force_page_cache_ra+226
filemap_get_pages+233
filemap_read+164
xfs_file_buffered_read+152
]: 223
Let me know if you only need the slow ones.
I should mention that there are really two issues:
1. Expensive workingset_refault, which shows up on flamegraphs. We see
it for our rocksdb based database, which persists data on xfs (local
nvme).
2. Expensive workingset_refault that causes latency hiccups, but
doesn't show up on flamegraphs. We see it in our nginx based proxy
with swap enabled (either zram or regular file on xfs on local nvme).
We solved the latter by disabling swap. I think the proper solution
would be for workingset_refault to be fast enough to be invisible, in
line with what was happening on Linux 5.10.
next prev parent reply other threads:[~2022-02-23 19:28 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-23 13:51 Daniel Dao
2022-02-23 15:57 ` Shakeel Butt
2022-02-23 16:00 ` Shakeel Butt
2022-02-23 17:07 ` Daniel Dao
2022-02-23 17:36 ` Shakeel Butt
2022-02-23 19:28 ` Ivan Babrou [this message]
2022-02-23 20:28 ` Shakeel Butt
2022-02-23 21:16 ` Ivan Babrou
2022-02-24 14:46 ` Daniel Dao
2022-02-24 16:58 ` Shakeel Butt
2022-02-24 17:34 ` Daniel Dao
2022-02-24 18:00 ` Shakeel Butt
2022-02-24 18:52 ` Shakeel Butt
2022-02-25 10:23 ` Daniel Dao
2022-02-25 17:08 ` Ivan Babrou
2022-02-25 17:22 ` Shakeel Butt
2022-02-25 18:03 ` Michal Koutný
2022-02-25 18:08 ` Ivan Babrou
2022-02-28 23:09 ` Shakeel Butt
2022-02-28 23:34 ` Ivan Babrou
2022-02-28 23:43 ` Shakeel Butt
2022-03-02 0:48 ` Ivan Babrou
2022-03-02 2:50 ` Shakeel Butt
2022-03-02 3:40 ` Ivan Babrou
2022-03-02 22:33 ` Ivan Babrou
2022-03-03 2:32 ` Shakeel Butt
2022-03-03 2:35 ` Shakeel Butt
2022-03-04 0:21 ` Ivan Babrou
2022-03-04 1:05 ` Shakeel Butt
2022-03-04 1:12 ` Ivan Babrou
2022-03-02 11:49 ` Frank Hofmann
2022-03-02 15:52 ` Shakeel Butt
2022-03-02 10:08 ` Michal Koutný
2022-03-02 15:53 ` Shakeel Butt
2022-03-02 17:28 ` Ivan Babrou
2022-02-24 9:22 ` Thorsten Leemhuis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CABWYdi0yzaKdOC2r6tgWODmQtHT_8NgaHQhrFEo_to0BWDKP2A@mail.gmail.com \
--to=ivan@cloudflare.com \
--cc=akpm@linux-foundation.org \
--cc=dqminh@cloudflare.com \
--cc=feng.tang@intel.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=hdanton@sina.com \
--cc=kernel-team@cloudflare.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=shakeelb@google.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox