linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Honglei Wang <honglei.wang@oracle.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@intel.com>
Subject: memcgroup lruvec_lru_size scaling issue
Date: Mon, 14 Oct 2019 10:17:45 -0700	[thread overview]
Message-ID: <a64eecf1-81d4-371f-ff6d-1cb057bd091c@linux.intel.com> (raw)

We were running a database benchmark in mem cgroup and found that lruvec_lru_size
is taking up a huge chunk of CPU cycles (about 25% of our kernel time) on 5.3 kernel.

The main issue is the loop in lruvec_page_state_local called by lruvec_lru_size
in the mem cgroup path:

for_each_possible_cpu(cpu)
	x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);

It is costly looping through all the cpus to get the lru vec size info.
And doing this on our workload with 96 cpu threads and 500 mem cgroups
makes things much worse.  We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs,
which is a lot of data structures to be running through all the time.

Hongwei's patch
(https://lore.kernel.org/linux-mm/991b4719-a2a0-9efe-de02-56a928752fe3@oracle.com/)
restores the previous method for computing lru_size and is much more efficient in getting the lru_size.
We got a 20% throughput improvement in our database benchmark with Hongwei's patch, and
lruvec_lru_size's cpu overhead completely disappeared from the cpu profile.

We'll like to see Hongwei's patch getting merged.

The problem can also be reproduced by running simple multi-threaded pmbench benchmark
with a fast Optane SSD swap (see profile below).


6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
            |
            |--3.07%--lruvec_lru_size
            |          |
            |          |--2.11%--cpumask_next
            |          |          |
            |          |           --1.66%--find_next_bit
            |          |
            |           --0.57%--call_function_interrupt
            |                     |
            |                      --0.55%--smp_call_function_interrupt
            |
            |--1.59%--0x441f0fc3d009
            |          _ops_rdtsc_init_base_freq
            |          access_histogram
            |          page_fault
            |          __do_page_fault
            |          handle_mm_fault
            |          __handle_mm_fault
            |          |
            |           --1.54%--do_swap_page
            |                     swapin_readahead
            |                     swap_cluster_readahead
            |                     |
            |                      --1.53%--read_swap_cache_async
            |                                __read_swap_cache_async
            |                                alloc_pages_vma
            |                                __alloc_pages_nodemask
            |                                __alloc_pages_slowpath
            |                                try_to_free_pages
            |                                do_try_to_free_pages
            |                                shrink_node
            |                                shrink_node_memcg
            |                                |
            |                                |--0.77%--lruvec_lru_size
            |                                |
            |                                 --0.76%--inactive_list_is_low
            |                                           |
            |                                            --0.76%--lruvec_lru_size
            |
             --1.50%--measure_read
                       page_fault
                       __do_page_fault
                       handle_mm_fault
                       __handle_mm_fault
                       do_swap_page
                       swapin_readahead
                       swap_cluster_readahead
                       |
                        --1.48%--read_swap_cache_async
                                  __read_swap_cache_async
                                  alloc_pages_vma
                                  __alloc_pages_nodemask
                                  __alloc_pages_slowpath
                                  try_to_free_pages
                                  do_try_to_free_pages
                                  shrink_node
                                  shrink_node_memcg
                                  |
                                  |--0.75%--inactive_list_is_low
                                  |          |
                                  |           --0.75%--lruvec_lru_size
                                  |
                                   --0.73%--lruvec_lru_size


Thanks.

Tim


             reply	other threads:[~2019-10-14 17:17 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-14 17:17 Tim Chen [this message]
2019-10-14 17:37 ` Michal Hocko
2019-10-14 17:49   ` Dave Hansen
2019-10-14 17:59     ` Michal Hocko
2019-10-14 18:06       ` Tim Chen
2019-10-14 18:31         ` Michal Hocko
2019-10-14 22:14           ` Andrew Morton
2019-10-15  6:19             ` Michal Hocko
2019-10-15 20:38               ` Andrew Morton
2019-10-16  7:25                 ` Michal Hocko
2019-10-15 18:23             ` Tim Chen
2019-10-14 18:11       ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a64eecf1-81d4-371f-ff6d-1cb057bd091c@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=honglei.wang@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox