From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E18BEECE588 for ; Wed, 16 Oct 2019 07:25:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9EE6320873 for ; Wed, 16 Oct 2019 07:25:24 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9EE6320873 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 23A068E0005; Wed, 16 Oct 2019 03:25:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E9CA8E0001; Wed, 16 Oct 2019 03:25:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D8D48E0005; Wed, 16 Oct 2019 03:25:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0021.hostedemail.com [216.40.44.21]) by kanga.kvack.org (Postfix) with ESMTP id DFCA78E0001 for ; Wed, 16 Oct 2019 03:25:23 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 7B1CA181EBF11 for ; Wed, 16 Oct 2019 07:25:23 +0000 (UTC) X-FDA: 76048812126.14.pear15_6933d24cc765b X-HE-Tag: pear15_6933d24cc765b X-Filterd-Recvd-Size: 8899 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf38.hostedemail.com (Postfix) with ESMTP for ; Wed, 16 Oct 2019 07:25:22 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 1D2F8AF24; Wed, 16 Oct 2019 07:25:21 +0000 (UTC) Date: Wed, 16 Oct 2019 09:25:20 +0200 From: Michal Hocko To: Tim Chen , Andrew Morton Cc: Dave Hansen , Honglei Wang , Johannes Weiner , linux-mm@kvack.org Subject: Re: memcgroup lruvec_lru_size scaling issue Message-ID: <20191016072520.GK317@dhcp22.suse.cz> References: <20191014173723.GM317@dhcp22.suse.cz> <0da86744-be11-06a8-2b38-5525dfe9d21e@intel.com> <20191014175918.GN317@dhcp22.suse.cz> <40748407-eafc-e08b-5777-1cbf892fcc52@linux.intel.com> <20191014183107.GO317@dhcp22.suse.cz> <20191014151430.fc419425e515188b904cd8af@linux-foundation.org> <20191015061920.GQ317@dhcp22.suse.cz> <20191015133831.945341efe6c100d922291653@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191015133831.945341efe6c100d922291653@linux-foundation.org> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue 15-10-19 13:38:31, Andrew Morton wrote: > On Tue, 15 Oct 2019 08:19:20 +0200 Michal Hocko wrote: > > > I dunno, but squashing those two changelogs sounds more confusing than > > helpful to me. What about the folowing instead? > > > > From: Honglei Wang > Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size > > 1a61ab8038e72 ("mm: memcontrol: replace zone summing with > lruvec_page_state()") has made lruvec_page_state to use per-cpu counters > instead of calculating it directly from lru_zone_size with an idea that > this would be more effective. Tim has reported that this is not really > the case for their database benchmark which is showing an opposite results > where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25% > of the system time which is roughly 7% of total cpu cycles) on 5.3 > kernels. The workload is running on a larger machine (96cpus), it has > many cgroups (500) and it is heavily direct reclaim bound. > > Tim Chen said: > > : The problem can also be reproduced by running simple multi-threaded > : pmbench benchmark with a fast Optane SSD swap (see profile below). > : > : > : 6.15% 3.08% pmbench [kernel.vmlinux] [k] lruvec_lru_size > : | > : |--3.07%--lruvec_lru_size > : | | > : | |--2.11%--cpumask_next > : | | | > : | | --1.66%--find_next_bit > : | | > : | --0.57%--call_function_interrupt > : | | > : | --0.55%--smp_call_function_interrupt > : | > : |--1.59%--0x441f0fc3d009 > : | _ops_rdtsc_init_base_freq > : | access_histogram > : | page_fault > : | __do_page_fault > : | handle_mm_fault > : | __handle_mm_fault > : | | > : | --1.54%--do_swap_page > : | swapin_readahead > : | swap_cluster_readahead > : | | > : | --1.53%--read_swap_cache_async > : | __read_swap_cache_async > : | alloc_pages_vma > : | __alloc_pages_nodemask > : | __alloc_pages_slowpath > : | try_to_free_pages > : | do_try_to_free_pages > : | shrink_node > : | shrink_node_memcg > : | | > : | |--0.77%--lruvec_lru_size > : | | > : | --0.76%--inactive_list_is_low > : | | > : | --0.76%--lruvec_lru_size > : | > : --1.50%--measure_read > : page_fault > : __do_page_fault > : handle_mm_fault > : __handle_mm_fault > : do_swap_page > : swapin_readahead > : swap_cluster_readahead > : | > : --1.48%--read_swap_cache_async > : __read_swap_cache_async > : alloc_pages_vma > : __alloc_pages_nodemask > : __alloc_pages_slowpath > : try_to_free_pages > : do_try_to_free_pages > : shrink_node > : shrink_node_memcg > : | > : |--0.75%--inactive_list_is_low > : | | > : | --0.75%--lruvec_lru_size > : | > : --0.73%--lruvec_lru_size > > > The likely culprit is the cache traffic the lruvec_page_state_local > generates. Dave Hansen says: > > : I was thinking purely of the cache footprint. If it's reading > : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 > : bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would > : be 18k of data for the whole system and the caching would be pretty > : efficient and all 18k would probably survive a tight page fault loop in > : the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't > : fit in the L1 and probably wouldn't survive a tight page fault loop if > : both logical threads were banging on different cgroups. > : > : It's just a theory, but it's why I noted the number of cgroups when I > : initially saw this show up in profiles Btw. that theory could be confirmed by an increased number of cache misses IIUC. Tim, could you give it a try please? > > Fix the regression by partially reverting the said commit and calculate > the lru size explicitly. > > Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com > Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") > Signed-off-by: Honglei Wang > Reported-by: Tim Chen > Acked-by: Tim Chen > Tested-by: Tim Chen > Cc: Vladimir Davydov > Cc: Johannes Weiner > Cc: Roman Gushchin > Cc: Tejun Heo > Cc: Michal Hocko > Cc: Dave Hansen > Cc: [5.2+] > Signed-off-by: Andrew Morton Acked-by: Michal Hocko > --- > > mm/vmscan.c | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > --- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size > +++ a/mm/vmscan.c > @@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str > */ > unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) > { > - unsigned long lru_size; > + unsigned long lru_size = 0; > int zid; > > - if (!mem_cgroup_disabled()) > - lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); > - else > + if (!mem_cgroup_disabled()) { > + for (zid = 0; zid < MAX_NR_ZONES; zid++) > + lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid); > + } else > lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru); > > for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) { > _ -- Michal Hocko SUSE Labs