From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=CACD=YJ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.5 required=3.0 tests=MAILING_LIST_MULTI,
	SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E18BEECE588
	for <linux-mm@archiver.kernel.org>; Wed, 16 Oct 2019 07:25:24 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 9EE6320873
	for <linux-mm@archiver.kernel.org>; Wed, 16 Oct 2019 07:25:24 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9EE6320873
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 23A068E0005; Wed, 16 Oct 2019 03:25:24 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1E9CA8E0001; Wed, 16 Oct 2019 03:25:24 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0D8D48E0005; Wed, 16 Oct 2019 03:25:24 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0021.hostedemail.com [216.40.44.21])
	by kanga.kvack.org (Postfix) with ESMTP id DFCA78E0001
	for <linux-mm@kvack.org>; Wed, 16 Oct 2019 03:25:23 -0400 (EDT)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with SMTP id 7B1CA181EBF11
	for <linux-mm@kvack.org>; Wed, 16 Oct 2019 07:25:23 +0000 (UTC)
X-FDA: 76048812126.14.pear15_6933d24cc765b
X-HE-Tag: pear15_6933d24cc765b
X-Filterd-Recvd-Size: 8899
Received: from mx1.suse.de (mx2.suse.de [195.135.220.15])
	by imf38.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 16 Oct 2019 07:25:22 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.220.254])
	by mx1.suse.de (Postfix) with ESMTP id 1D2F8AF24;
	Wed, 16 Oct 2019 07:25:21 +0000 (UTC)
Date: Wed, 16 Oct 2019 09:25:20 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Tim Chen <tim.c.chen@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>,
	Honglei Wang <honglei.wang@oracle.com>,
	Johannes Weiner <hannes@cmpxchg.org>, linux-mm@kvack.org
Subject: Re: memcgroup lruvec_lru_size scaling issue
Message-ID: <20191016072520.GK317@dhcp22.suse.cz>
References: <a64eecf1-81d4-371f-ff6d-1cb057bd091c@linux.intel.com>
 <20191014173723.GM317@dhcp22.suse.cz>
 <0da86744-be11-06a8-2b38-5525dfe9d21e@intel.com>
 <20191014175918.GN317@dhcp22.suse.cz>
 <40748407-eafc-e08b-5777-1cbf892fcc52@linux.intel.com>
 <20191014183107.GO317@dhcp22.suse.cz>
 <20191014151430.fc419425e515188b904cd8af@linux-foundation.org>
 <20191015061920.GQ317@dhcp22.suse.cz>
 <20191015133831.945341efe6c100d922291653@linux-foundation.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20191015133831.945341efe6c100d922291653@linux-foundation.org>
User-Agent: Mutt/1.10.1 (2018-07-13)
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue 15-10-19 13:38:31, Andrew Morton wrote:
> On Tue, 15 Oct 2019 08:19:20 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > I dunno, but squashing those two changelogs sounds more confusing than
> > helpful to me. What about the folowing instead?
> 
> 
> 
> From: Honglei Wang <honglei.wang@oracle.com>
> Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size
> 
> 1a61ab8038e72 ("mm: memcontrol: replace zone summing with
> lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
> instead of calculating it directly from lru_zone_size with an idea that
> this would be more effective.  Tim has reported that this is not really
> the case for their database benchmark which is showing an opposite results
> where lruvec_page_state is taking up a huge chunk of CPU cycles (about 25%
> of the system time which is roughly 7% of total cpu cycles) on 5.3
> kernels.  The workload is running on a larger machine (96cpus), it has
> many cgroups (500) and it is heavily direct reclaim bound.
> 
> Tim Chen said:
> 
> : The problem can also be reproduced by running simple multi-threaded
> : pmbench benchmark with a fast Optane SSD swap (see profile below).
> : 
> : 
> : 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
> :             |
> :             |--3.07%--lruvec_lru_size
> :             |          |
> :             |          |--2.11%--cpumask_next
> :             |          |          |
> :             |          |           --1.66%--find_next_bit
> :             |          |
> :             |           --0.57%--call_function_interrupt
> :             |                     |
> :             |                      --0.55%--smp_call_function_interrupt
> :             |
> :             |--1.59%--0x441f0fc3d009
> :             |          _ops_rdtsc_init_base_freq
> :             |          access_histogram
> :             |          page_fault
> :             |          __do_page_fault
> :             |          handle_mm_fault
> :             |          __handle_mm_fault
> :             |          |
> :             |           --1.54%--do_swap_page
> :             |                     swapin_readahead
> :             |                     swap_cluster_readahead
> :             |                     |
> :             |                      --1.53%--read_swap_cache_async
> :             |                                __read_swap_cache_async
> :             |                                alloc_pages_vma
> :             |                                __alloc_pages_nodemask
> :             |                                __alloc_pages_slowpath
> :             |                                try_to_free_pages
> :             |                                do_try_to_free_pages
> :             |                                shrink_node
> :             |                                shrink_node_memcg
> :             |                                |
> :             |                                |--0.77%--lruvec_lru_size
> :             |                                |
> :             |                                 --0.76%--inactive_list_is_low
> :             |                                           |
> :             |                                            --0.76%--lruvec_lru_size
> :             |
> :              --1.50%--measure_read
> :                        page_fault
> :                        __do_page_fault
> :                        handle_mm_fault
> :                        __handle_mm_fault
> :                        do_swap_page
> :                        swapin_readahead
> :                        swap_cluster_readahead
> :                        |
> :                         --1.48%--read_swap_cache_async
> :                                   __read_swap_cache_async
> :                                   alloc_pages_vma
> :                                   __alloc_pages_nodemask
> :                                   __alloc_pages_slowpath
> :                                   try_to_free_pages
> :                                   do_try_to_free_pages
> :                                   shrink_node
> :                                   shrink_node_memcg
> :                                   |
> :                                   |--0.75%--inactive_list_is_low
> :                                   |          |
> :                                   |           --0.75%--lruvec_lru_size
> :                                   |
> :                                    --0.73%--lruvec_lru_size
> 
> 
> The likely culprit is the cache traffic the lruvec_page_state_local
> generates. Dave Hansen says:
> 
> : I was thinking purely of the cache footprint.  If it's reading
> : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
> : bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
> : be 18k of data for the whole system and the caching would be pretty
> : efficient and all 18k would probably survive a tight page fault loop in
> : the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
> : fit in the L1 and probably wouldn't survive a tight page fault loop if
> : both logical threads were banging on different cgroups.
> : 
> : It's just a theory, but it's why I noted the number of cgroups when I
> : initially saw this show up in profiles

Btw. that theory could be confirmed by an increased number of cache
misses IIUC. Tim, could you give it a try please?

> 
> Fix the regression by partially reverting the said commit and calculate
> the lru size explicitly.
> 
> Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
> Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
> Signed-off-by: Honglei Wang <honglei.wang@oracle.com>
> Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
> Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
> Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: <stable@vger.kernel.org>	[5.2+]
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 
>  mm/vmscan.c |    9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> --- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size
> +++ a/mm/vmscan.c
> @@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str
>   */
>  unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx)
>  {
> -	unsigned long lru_size;
> +	unsigned long lru_size = 0;
>  	int zid;
>  
> -	if (!mem_cgroup_disabled())
> -		lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru);
> -	else
> +	if (!mem_cgroup_disabled()) {
> +		for (zid = 0; zid < MAX_NR_ZONES; zid++)
> +			lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid);
> +	} else
>  		lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru);
>  
>  	for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> _

-- 
Michal Hocko
SUSE Labs