From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82F21ECE587 for ; Mon, 14 Oct 2019 17:59:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 55D1D2089C for ; Mon, 14 Oct 2019 17:59:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 55D1D2089C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D112D8E0005; Mon, 14 Oct 2019 13:59:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CBC0A8E0001; Mon, 14 Oct 2019 13:59:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B88FD8E0005; Mon, 14 Oct 2019 13:59:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0041.hostedemail.com [216.40.44.41]) by kanga.kvack.org (Postfix) with ESMTP id 970288E0001 for ; Mon, 14 Oct 2019 13:59:21 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id 350C6582B for ; Mon, 14 Oct 2019 17:59:21 +0000 (UTC) X-FDA: 76043152122.19.sheet45_8bd262c06e052 X-HE-Tag: sheet45_8bd262c06e052 X-Filterd-Recvd-Size: 2797 Received: from mx1.suse.de (mx2.suse.de [195.135.220.15]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Mon, 14 Oct 2019 17:59:20 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 472A8B515; Mon, 14 Oct 2019 17:59:19 +0000 (UTC) Date: Mon, 14 Oct 2019 19:59:18 +0200 From: Michal Hocko To: Dave Hansen Cc: Tim Chen , Honglei Wang , Johannes Weiner , linux-mm@kvack.org, Andrew Morton Subject: Re: memcgroup lruvec_lru_size scaling issue Message-ID: <20191014175918.GN317@dhcp22.suse.cz> References: <20191014173723.GM317@dhcp22.suse.cz> <0da86744-be11-06a8-2b38-5525dfe9d21e@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0da86744-be11-06a8-2b38-5525dfe9d21e@intel.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon 14-10-19 10:49:49, Dave Hansen wrote: > On 10/14/19 10:37 AM, Michal Hocko wrote: > >> for_each_possible_cpu(cpu) > >> x += per_cpu(pn->lruvec_stat_local->count[idx], cpu); > >> > >> It is costly looping through all the cpus to get the lru vec size info. > >> And doing this on our workload with 96 cpu threads and 500 mem cgroups > >> makes things much worse. We might end up having 96 cpus * 500 cgroups * 2 (main) LRUs pagevecs, > >> which is a lot of data structures to be running through all the time. > > Why does the number of cgroup matter? > > I was thinking purely of the cache footprint. If it's reading > pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192 > bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would > be 18k of data for the whole system and the caching would be pretty > efficient and all 18k would probably survive a tight page fault loop in > the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't > fit in the L1 and probably wouldn't survive a tight page fault loop if > both logical threads were banging on different cgroups. > > It's just a theory, but it's why I noted the number of cgroups when I > initially saw this show up in profiles. Yes, the cache traffic might be really high but I still find it a bit surprising that it makes such a large footprint because this should be mostly called from slow paths (reclaim) and the real work done should just be larger - at least that's my intuition which might be quite off here. How much is that 25% of the system time in the total time btw? -- Michal Hocko SUSE Labs