From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15AC4C4CECE for ; Mon, 14 Oct 2019 17:37:28 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D9F4A20854 for ; Mon, 14 Oct 2019 17:37:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D9F4A20854 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 88DA68E0006; Mon, 14 Oct 2019 13:37:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 83DAF8E0001; Mon, 14 Oct 2019 13:37:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7A24C8E0006; Mon, 14 Oct 2019 13:37:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0208.hostedemail.com [216.40.44.208]) by kanga.kvack.org (Postfix) with ESMTP id 5D0CF8E0001 for ; Mon, 14 Oct 2019 13:37:27 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 112796D94 for ; Mon, 14 Oct 2019 17:37:27 +0000 (UTC) X-FDA: 76043096934.27.loss99_5e194fa40d503 X-HE-Tag: loss99_5e194fa40d503 X-Filterd-Recvd-Size: 6039 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Mon, 14 Oct 2019 17:37:26 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id F0120ACBD; Mon, 14 Oct 2019 17:37:24 +0000 (UTC) Date: Mon, 14 Oct 2019 19:37:23 +0200 From: Michal Hocko To: Tim Chen Cc: Honglei Wang , Johannes Weiner , linux-mm@kvack.org, Andrew Morton , Dave Hansen Subject: Re: memcgroup lruvec_lru_size scaling issue Message-ID: <20191014173723.GM317@dhcp22.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon 14-10-19 10:17:45, Tim Chen wrote: > We were running a database benchmark in mem cgroup and found that lruvec_lru_size > is taking up a huge chunk of CPU cycles (about 25% of our kernel time) on 5.3 kernel. > > The main issue is the loop in lruvec_page_state_local called by lruvec_lru_size > in the mem cgroup path: > > for_each_possible_cpu(cpu) > x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); > > It is costly looping through all the cpus to get the lru vec size info. > And doing this on our workload with 96 cpu threads and 500 mem cgroups > makes things much worse. We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs, > which is a lot of data structures to be running through all the time. Why does the number of cgroup matter? > Hongwei's patch > (https://lore.kernel.org/linux-mm/991b4719-a2a0-9efe-de02-56a928752fe3@oracle.com/) > restores the previous method for computing lru_size and is much more efficient in getting the lru_size. > We got a 20% throughput improvement in our database benchmark with Hongwei's patch, and > lruvec_lru_size's cpu overhead completely disappeared from the cpu profile. > > We'll like to see Hongwei's patch getting merged. The main problem with the patch was a lack of justification. If the performance approvement is this large (I am quite surprised TBH) then I would obviously not have any objections. Care to send a patch with the complete changelog? > The problem can also be reproduced by running simple multi-threaded pmbench benchmark > with a fast Optane SSD swap (see profile below). > > > 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size > | > |--3.07%--lruvec_lru_size > | | > | |--2.11%--cpumask_next > | | | > | | --1.66%--find_next_bit > | | > | --0.57%--call_function_interrupt > | | > | --0.55%--smp_call_function_interrupt > | > |--1.59%--0x441f0fc3d009 > | _ops_rdtsc_init_base_freq > | access_histogram > | page_fault > | __do_page_fault > | handle_mm_fault > | __handle_mm_fault > | | > | --1.54%--do_swap_page > | swapin_readahead > | swap_cluster_readahead > | | > | --1.53%--read_swap_cache_async > | __read_swap_cache_async > | alloc_pages_vma > | __alloc_pages_nodemask > | __alloc_pages_slowpath > | try_to_free_pages > | do_try_to_free_pages > | shrink_node > | shrink_node_memcg > | | > | |--0.77%--lruvec_lru_size > | | > | --0.76%--inactive_list_is_low > | | > | --0.76%--lruvec_lru_size > | > --1.50%--measure_read > page_fault > __do_page_fault > handle_mm_fault > __handle_mm_fault > do_swap_page > swapin_readahead > swap_cluster_readahead > | > --1.48%--read_swap_cache_async > __read_swap_cache_async > alloc_pages_vma > __alloc_pages_nodemask > __alloc_pages_slowpath > try_to_free_pages > do_try_to_free_pages > shrink_node > shrink_node_memcg > | > |--0.75%--inactive_list_is_low > | | > | --0.75%--lruvec_lru_size > | > --0.73%--lruvec_lru_size > > > Thanks. > > Tim -- Michal Hocko SUSE Labs