From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 723B0C4CECE for ; Mon, 14 Oct 2019 22:14:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 24330217D9 for ; Mon, 14 Oct 2019 22:14:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="BlBXNyPH" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 24330217D9 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9BAF98E0005; Mon, 14 Oct 2019 18:14:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 96B788E0001; Mon, 14 Oct 2019 18:14:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 881218E0005; Mon, 14 Oct 2019 18:14:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0212.hostedemail.com [216.40.44.212]) by kanga.kvack.org (Postfix) with ESMTP id 67C7B8E0001 for ; Mon, 14 Oct 2019 18:14:33 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id E9AA07597 for ; Mon, 14 Oct 2019 22:14:32 +0000 (UTC) X-FDA: 76043795184.28.level08_27a943bebac23 X-HE-Tag: level08_27a943bebac23 X-Filterd-Recvd-Size: 6291 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf28.hostedemail.com (Postfix) with ESMTP for ; Mon, 14 Oct 2019 22:14:32 +0000 (UTC) Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 2AEC3217D9; Mon, 14 Oct 2019 22:14:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1571091271; bh=2564hUDK7w/QnvAohNeCef+x3/BIqbgnYbiW3SikJxM=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=BlBXNyPHllZ8jX6pMa7Sp+owG3awwjjmrgJ+8e96wavldyyr13HvSV01WnldbARSQ QzTDyVPfYOSw8DZj846HPKnyZven+fwFCgb3dLQuwBj1am8QodeY8bxiP6MScE7hhx 0Wub+BjVdCdDXCuRajdCDaTSzQk7RQrrN3V7eOgc= Date: Mon, 14 Oct 2019 15:14:30 -0700 From: Andrew Morton To: Michal Hocko Cc: Tim Chen , Dave Hansen , Honglei Wang , Johannes Weiner , linux-mm@kvack.org Subject: Re: memcgroup lruvec_lru_size scaling issue Message-Id: <20191014151430.fc419425e515188b904cd8af@linux-foundation.org> In-Reply-To: <20191014183107.GO317@dhcp22.suse.cz> References: <20191014173723.GM317@dhcp22.suse.cz> <0da86744-be11-06a8-2b38-5525dfe9d21e@intel.com> <20191014175918.GN317@dhcp22.suse.cz> <40748407-eafc-e08b-5777-1cbf892fcc52@linux.intel.com> <20191014183107.GO317@dhcp22.suse.cz> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.32; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, 14 Oct 2019 20:31:07 +0200 Michal Hocko wrote: > Please put all that to the changelog. It would be also great to see > whether that really scales with the number of cgroups if that is easy to > check with your benchmark. I restored this, with some changelog additions. Tim, please send along any updates you'd like to make to this, along with tested-by, acked-by, etc. I added the Fixes: tag. Do we think this is serious enough to warrant backporting into -stable trees? I suspect so, as any older-tree maintainer who spies this change will say "hell yeah". From: Honglei Wang Subject: mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size lruvec_lru_size() is invokving lruvec_page_state_local() to get the lru_size. It's base on lruvec_stat_local.count[] of mem_cgroup_per_node. This counter is updated in a batched way. It won't be charged if the number of incoming pages doesn't meet the needs of MEMCG_CHARGE_BATCH which is defined as 32. The testcase in LTP madvise09[1] fails because small blocks of memory are not charged. It creates a new memcgroup and sets up 32 MADV_FREE pages. Then it forks a child who will introduce memory pressure in the memcgroup. The MADV_FREE pages are expected to be released under the pressure, but 32 is not more than MEMCG_CHARGE_BATCH and these pages won't be charged in lruvec_stat_local.count[] until some more pages come in to satisfy the needs of batch charging. So these MADV_FREE pages can't be freed in memory pressure which is a bit conflicted with the definition of MADV_FREE. Get the lru_size base on lru_zone_size of mem_cgroup_per_node which is not updated via batching can making it more accurate in this scenario. This is effectively a partial reversion of 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()"). [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise09.c Tim said: : We were running a database benchmark in mem cgroup and found that : lruvec_lru_size is taking up a huge chunk of CPU cycles (about 25% of our : kernel time - bout 7% of total cpu cycles) on 5.3 kernel. : : The main issue is the loop in lruvec_page_state_local called by : lruvec_lru_size in the mem cgroup path: : : for_each_possible_cpu(cpu) x += per_cpu(pn->lruvec_stat_local->count[idx], : cpu); It is costly looping through all the cpus to get the lru vec size : info. And doing this on our workload with 96 cpu threads and 500 mem : cgroups makes things much worse. We might end up having 96 cpus * 500 : cgroups * 2 (main) LRUs pagevecs, which is a lot of data structures to be : running through all the time. : : Hongwei's patch restores the previous method for computing lru_size and is : much more efficient in getting the lru_size. We got a 20% throughput : improvement in our database benchmark with Hongwei's patch, and : lruvec_lru_size's cpu overhead completely disappeared from the cpu : profile. Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") Signed-off-by: Honglei Wang Reported-by: Tim Chen Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Roman Gushchin Cc: Tejun Heo Cc: Michal Hocko Cc: Dave Hansen Signed-off-by: Andrew Morton --- mm/vmscan.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) --- a/mm/vmscan.c~mm-vmscan-get-number-of-pages-on-the-lru-list-in-memcgroup-base-on-lru_zone_size +++ a/mm/vmscan.c @@ -351,12 +351,13 @@ unsigned long zone_reclaimable_pages(str */ unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, int zone_idx) { - unsigned long lru_size; + unsigned long lru_size = 0; int zid; - if (!mem_cgroup_disabled()) - lru_size = lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); - else + if (!mem_cgroup_disabled()) { + for (zid = 0; zid < MAX_NR_ZONES; zid++) + lru_size += mem_cgroup_get_zone_lru_size(lruvec, lru, zid); + } else lru_size = node_page_state(lruvec_pgdat(lruvec), NR_LRU_BASE + lru); for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) { _