From: Michal Hocko <mhocko@kernel.org>
To: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Honglei Wang <honglei.wang@oracle.com>,
Johannes Weiner <hannes@cmpxchg.org>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Dave Hansen <dave.hansen@intel.com>
Subject: Re: memcgroup lruvec_lru_size scaling issue
Date: Mon, 14 Oct 2019 19:37:23 +0200 [thread overview]
Message-ID: <20191014173723.GM317@dhcp22.suse.cz> (raw)
In-Reply-To: <a64eecf1-81d4-371f-ff6d-1cb057bd091c@linux.intel.com>
On Mon 14-10-19 10:17:45, Tim Chen wrote:
> We were running a database benchmark in mem cgroup and found that lruvec_lru_size
> is taking up a huge chunk of CPU cycles (about 25% of our kernel time) on 5.3 kernel.
>
> The main issue is the loop in lruvec_page_state_local called by lruvec_lru_size
> in the mem cgroup path:
>
> for_each_possible_cpu(cpu)
> x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
>
> It is costly looping through all the cpus to get the lru vec size info.
> And doing this on our workload with 96 cpu threads and 500 mem cgroups
> makes things much worse. We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
> which is a lot of data structures to be running through all the time.
Why does the number of cgroup matter?
> Hongwei's patch
> (https://lore.kernel.org/linux-mm/991b4719-a2a0-9efe-de02-56a928752fe3@oracle.com/)
> restores the previous method for computing lru_size and is much more efficient in getting the lru_size.
> We got a 20% throughput improvement in our database benchmark with Hongwei's patch, and
> lruvec_lru_size's cpu overhead completely disappeared from the cpu profile.
>
> We'll like to see Hongwei's patch getting merged.
The main problem with the patch was a lack of justification. If the
performance approvement is this large (I am quite surprised TBH) then I
would obviously not have any objections. Care to send a patch with the
complete changelog?
> The problem can also be reproduced by running simple multi-threaded pmbench benchmark
> with a fast Optane SSD swap (see profile below).
>
>
> 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size
> |
> |--3.07%--lruvec_lru_size
> | |
> | |--2.11%--cpumask_next
> | | |
> | | --1.66%--find_next_bit
> | |
> | --0.57%--call_function_interrupt
> | |
> | --0.55%--smp_call_function_interrupt
> |
> |--1.59%--0x441f0fc3d009
> | _ops_rdtsc_init_base_freq
> | access_histogram
> | page_fault
> | __do_page_fault
> | handle_mm_fault
> | __handle_mm_fault
> | |
> | --1.54%--do_swap_page
> | swapin_readahead
> | swap_cluster_readahead
> | |
> | --1.53%--read_swap_cache_async
> | __read_swap_cache_async
> | alloc_pages_vma
> | __alloc_pages_nodemask
> | __alloc_pages_slowpath
> | try_to_free_pages
> | do_try_to_free_pages
> | shrink_node
> | shrink_node_memcg
> | |
> | |--0.77%--lruvec_lru_size
> | |
> | --0.76%--inactive_list_is_low
> | |
> | --0.76%--lruvec_lru_size
> |
> --1.50%--measure_read
> page_fault
> __do_page_fault
> handle_mm_fault
> __handle_mm_fault
> do_swap_page
> swapin_readahead
> swap_cluster_readahead
> |
> --1.48%--read_swap_cache_async
> __read_swap_cache_async
> alloc_pages_vma
> __alloc_pages_nodemask
> __alloc_pages_slowpath
> try_to_free_pages
> do_try_to_free_pages
> shrink_node
> shrink_node_memcg
> |
> |--0.75%--inactive_list_is_low
> | |
> | --0.75%--lruvec_lru_size
> |
> --0.73%--lruvec_lru_size
>
>
> Thanks.
>
> Tim
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2019-10-14 17:37 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-10-14 17:17 Tim Chen
2019-10-14 17:37 ` Michal Hocko [this message]
2019-10-14 17:49 ` Dave Hansen
2019-10-14 17:59 ` Michal Hocko
2019-10-14 18:06 ` Tim Chen
2019-10-14 18:31 ` Michal Hocko
2019-10-14 22:14 ` Andrew Morton
2019-10-15 6:19 ` Michal Hocko
2019-10-15 20:38 ` Andrew Morton
2019-10-16 7:25 ` Michal Hocko
2019-10-15 18:23 ` Tim Chen
2019-10-14 18:11 ` Dave Hansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191014173723.GM317@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=dave.hansen@intel.com \
--cc=hannes@cmpxchg.org \
--cc=honglei.wang@oracle.com \
--cc=linux-mm@kvack.org \
--cc=tim.c.chen@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox