From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1350BC3DA4A for ; Wed, 14 Aug 2024 16:32:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 86BA96B00B6; Wed, 14 Aug 2024 12:32:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 81B1F6B00B4; Wed, 14 Aug 2024 12:32:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 66BC36B00B6; Wed, 14 Aug 2024 12:32:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 49DE86B00AD for ; Wed, 14 Aug 2024 12:32:46 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id ED314C0F05 for ; Wed, 14 Aug 2024 16:32:45 +0000 (UTC) X-FDA: 82451394690.02.1B1C9A6 Received: from out-175.mta0.migadu.com (out-175.mta0.migadu.com [91.218.175.175]) by imf16.hostedemail.com (Postfix) with ESMTP id B7A4918001A for ; Wed, 14 Aug 2024 16:32:43 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=RspJC0PG; spf=pass (imf16.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.175 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723653092; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EHHfx+7SrVO0qYi7gvV9dreWu/gafPEVhenzIW8CUC8=; b=mvpYlA8jFfea6NQnVRzGGi6rqqKrw7HkWbEEEpNyEzxJLE70LrLsZILhN4AmYedq/rtJaj ylAQGkECIahL5DsvNrrtCJQ12kmAc4D/FHPszEK4GtQMaIBd6f7mdpz30uL2nZnuD+sHv6 i1+O0B/H8CEBEfp8waW1Gx0EeiZI368= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723653092; a=rsa-sha256; cv=none; b=FsCtv2LlZWRPTyyptdHpaSZpzkhCqW8eS6/8GRXBY9PXS2HpuYHGFYYoDKUPsxjXmJwIJz YKDquxEk28oUkHDc2PQMiyJVi1kHj/7bHQaXsa1tAap2visyXCfLq5/DklZadMQijlE7Ii Xqy04aVhiY3auxD5tDAWtRXCOmBXJZw= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=RspJC0PG; spf=pass (imf16.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.175 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Wed, 14 Aug 2024 09:32:36 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1723653161; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EHHfx+7SrVO0qYi7gvV9dreWu/gafPEVhenzIW8CUC8=; b=RspJC0PGSARWi/QcEHuDHJsuu22I8lofD4NM12CLtbrhd1SgF0tNrDkIJ+8Ak9vD8RlouU XgflpmGey87qX2YvnZecJKQp1ajgpVLanU3DaX7hEbN6VbZwxiLdhTC978Mhk3rnODQ293 YzJ08a63WBE5KaOIWrhbnLG+nZd9Oj0= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Jesper Dangaard Brouer Cc: Yosry Ahmed , Nhat Pham , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yu Zhao , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Meta kernel team , cgroups@vger.kernel.org Subject: Re: [PATCH v2] memcg: use ratelimited stats flush in the reclaim Message-ID: References: <20240813215358.2259750-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B7A4918001A X-Stat-Signature: bc8q8aur8zckunix53gzob39tumegpc7 X-HE-Tag: 1723653163-95062 X-HE-Meta: U2FsdGVkX1/ODBY6JuQtl2aQde2tcTFcDg7DHjNGGCXI4vv+ktf1ZRkaBRAwxhoH9AO7EywvvWbk5kDsFf6hgiLzL919OYnKhL300KV21YGmDnCEv6hVUdkNkvn+4tgShqJiGDwWQY2yqLG2yywmRFsxEaUD85SGWJvZJK5NiHMmSiC4S0eEMSzN2jDYoUc9OJUyledh1nDqmTwW0o5AwQ73E2IC4Td8VAMw9wzPoFMNuPTp71fEwGwldrx0GFfMN05Cz8Ok7g56o86YpjPYzQaxqGnlpGigNW21WTfcRlhXC/ut1+Dz0k475+ArTfmr954gTsP9zrQ7sA6uw8XFh0QFaKDIxBJEPwGrU0RuvZ61hrJDMxwpIBPJPQEWFqBtWomebOUpSESVJd+or5VE2jqBDceCOAwC3bO/3la/ffh8Wa7OwVbbh+MF9jYW2v7WzsU9Xi2cioZ+vyd1GNA5xfiRiPYCgba4z8jya6JkjTXaMX8FWExkA+OrhGzaJ8KSltlx+ibQVEseX/2GMdb/voCqMaGaNqCxtldlBqW3Hhw02gRKHMjTc4d6qmvcxkMIyPhX3+xCYr9WQuNB4yWonDKrAocjVd1rjguDKkZXhij/eRTDPWDDfgUqK2PGRIW+q2D11zKYw+kSzLPw19pmJ+MK4NT3z/DVQ5vKAtxeqKk/RiipVcCyXgTaeKJJQL2I9rd7iALQGBmN84Ar6UpnU/06ajfIMi1p3FG53Iuy5HrIqDie2t0PAaECrNAQaVEJM/+6ysMMUptonRB0vIiPClK2Zum+Ew40uK5YM7mRbfL0PbC2k9UIEf9zdBJtdvxsa6B4nCuM1juzz1pa0qnCMlTDQMMlnJiMwF57eAbC6F0RekmolRn4OO4Rm+gZKWb1odJWcfiptCnlqoqS4I0REB03ramIc7MJTh0dzBdPQQslHDCrXVDMAcQ+pjtfe6p+p+WNU54cOr+ngUj8Qvt Bi+hg7Mo pFTfN74lml6KjcPyGbCR8/NrRrRjNcXiYvnJ2dR1bWDxmmDO71wH4+OCE1j7TyhyVL+J7oia+5Dmu+xHBgeG2U0O3tf2qF2bXlMxfsq91cVo5i8dmANfdFWHUJmgFqZoUdSLDpoIBR8pha0beGojbkiAIni2wkQkD+lTypGq4eC6vr2SiWab4IzHceafyue37ba20fOEv9zLxGws= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ccing Nhat On Wed, Aug 14, 2024 at 02:57:38PM GMT, Jesper Dangaard Brouer wrote: > > > On 14/08/2024 00.30, Shakeel Butt wrote: > > On Tue, Aug 13, 2024 at 02:58:51PM GMT, Yosry Ahmed wrote: > > > On Tue, Aug 13, 2024 at 2:54 PM Shakeel Butt wrote: > > > > > > > > The Meta prod is seeing large amount of stalls in memcg stats flush > > > > from the memcg reclaim code path. At the moment, this specific callsite > > > > is doing a synchronous memcg stats flush. The rstat flush is an > > > > expensive and time consuming operation, so concurrent relaimers will > > > > busywait on the lock potentially for a long time. Actually this issue is > > > > not unique to Meta and has been observed by Cloudflare [1] as well. For > > > > the Cloudflare case, the stalls were due to contention between kswapd > > > > threads running on their 8 numa node machines which does not make sense > > > > as rstat flush is global and flush from one kswapd thread should be > > > > sufficient for all. Simply replace the synchronous flush with the > > > > ratelimited one. > > > > > > > > One may raise a concern on potentially using 2 sec stale (at worst) > > > > stats for heuristics like desirable inactive:active ratio and preferring > > > > inactive file pages over anon pages but these specific heuristics do not > > > > require very precise stats and also are ignored under severe memory > > > > pressure. > > > > > > > > More specifically for this code path, the stats are needed for two > > > > specific heuristics: > > > > > > > > 1. Deactivate LRUs > > > > 2. Cache trim mode > > > > > > > > The deactivate LRUs heuristic is to maintain a desirable inactive:active > > > > ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* > > > > and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to > > > > check if there is a refault since last snapshot and the LRU size are > > > > needed for the desirable ratio between inactive and active LRUs. See the > > > > table below on how the desirable ratio is calculated. > > > > > > > > /* total target max > > > > * memory ratio inactive > > > > * ------------------------------------- > > > > * 10MB 1 5MB > > > > * 100MB 1 50MB > > > > * 1GB 3 250MB > > > > * 10GB 10 0.9GB > > > > * 100GB 31 3GB > > > > * 1TB 101 10GB > > > > * 10TB 320 32GB > > > > */ > > > > > > > > The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, > > > > 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate > > > > LRU size information to calculate this ratio. In addition, if > > > > deactivation is skipped for some LRU, the kernel will force deactive on > > > > the severe memory pressure situation. > > > > > > > > For the cache trim mode, inactive file LRU size is read and the kernel > > > > scales it down based on the reclaim iteration (file >> sc->priority) and > > > > only checks if it is zero or not. Again precise information is not > > > > needed. > > > > > > > > This patch has been running on Meta fleet for several months and we have > > > > not observed any issues. Please note that MGLRU is not impacted by this > > > > issue at all as it avoids rstat flushing completely. > > > > > > > > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] > > > > Signed-off-by: Shakeel Butt > > > > > > Just curious, does Jesper's patch help with this problem? > > > > If you are asking if I have tested Jesper's patch in Meta's production > > then no, I have not tested it. Also I have not taken a look at the > > latest from Jesper as I was stuck in some other issues. > > > > I see this patch as a whac-a-mole approach. But it should be applied as > a stopgap, because my patches are still not ready to be merged. > > My patch is more generic, but *only* solves the rstat lock contention > part of the issue. The remaining issue is that rstat is flushed too > often, which I address in my other patch[2] "cgroup/rstat: introduce > ratelimited rstat flushing". In [2], I explicitly excluded memcg as > Shakeel's patch demonstrates memcg already have a ratelimit API specific > to memcg. > > [2] https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/ > > I suspect the next whac-a-mole will be the rstat flush for the slab code > that kswapd also activates via shrink_slab, that via > shrinker->count_objects() invoke count_shadow_nodes(). > Actually count_shadow_nodes() is already using ratelimited version. However zswap_shrinker_count() is still using the sync version. Nhat is modifying this code at the moment and we can ask if we really need most accurate values for MEMCG_ZSWAP_B and MEMCG_ZSWAPPED for the zswap writeback heuristic. > --Jesper