From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9CD0C27C79 for ; Mon, 17 Jun 2024 17:20:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3F9896B0256; Mon, 17 Jun 2024 13:20:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A8E86B0257; Mon, 17 Jun 2024 13:20:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24A876B0258; Mon, 17 Jun 2024 13:20:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id F370A6B0256 for ; Mon, 17 Jun 2024 13:20:32 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A655DA1E29 for ; Mon, 17 Jun 2024 17:20:32 +0000 (UTC) X-FDA: 82241044704.17.ADC40A3 Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) by imf13.hostedemail.com (Postfix) with ESMTP id 0B55D20020 for ; Mon, 17 Jun 2024 17:20:29 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Jm0VGila; spf=pass (imf13.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.189 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718644827; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xdXwI+p6S0bnvx9513DEJ8ZGMQjZAhsSguPb47KnFQA=; b=nGz3GHNUA65Xwh//sRA4drv8btHeSJrgoOvKwRDd0dGLhoP3QlNBDrOkxzx5GXlE8bLPez cr6H4CIj5Iav6CDpyH3kYIiPrWzF4Gc53DpHVXC9FMmT2jRjrvmP2YkwuHNhnoPFud1ZnG M4urRXm+MjtzAxlfD90xfoAXlOk3cPw= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Jm0VGila; spf=pass (imf13.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.189 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718644827; a=rsa-sha256; cv=none; b=pWxQIE0rO6Y7qwVF+u8ZrXfiyJ69+qqc4L8yLM6Fd0a1nsuW+dG0k0+cpTFMs1tW3r6hOh hpFQzyiU/n/8/WO7fvTIwForcpGFxJBfP/cLUdm0Yk6JPlzd/kjutu+W/PFsOsMGOgQj2F wRp9Hpm6kgxvRYNelqp+A1LkpmUt7+k= X-Envelope-To: yosryahmed@google.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1718644827; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xdXwI+p6S0bnvx9513DEJ8ZGMQjZAhsSguPb47KnFQA=; b=Jm0VGilavH5nkzGdbhhIM42iiU/WwMWdGPZa60sATuC7kUE+3Ir34u3WI7idJehb/w9HzD lImEYVivyJffzMbQaN9yLqHK5K5skx3W5C2Z0gqBlSmGDu7ejfjpCp+1nmiilcg8i44RpA ItZP0EWbrEz2twO+qzDXpmTyUwbiqU4= X-Envelope-To: akpm@linux-foundation.org X-Envelope-To: hannes@cmpxchg.org X-Envelope-To: mhocko@suse.com X-Envelope-To: roman.gushchin@linux.dev X-Envelope-To: hawk@kernel.org X-Envelope-To: yuzhao@google.com X-Envelope-To: songmuchun@bytedance.com X-Envelope-To: kernel-team@meta.com X-Envelope-To: linux-mm@kvack.org X-Envelope-To: linux-kernel@vger.kernel.org Date: Mon, 17 Jun 2024 10:20:21 -0700 X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Yosry Ahmed Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Jesper Dangaard Brouer , Yu Zhao , Muchun Song , Facebook Kernel Team , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim Message-ID: References: <20240615081257.3945587-1-shakeel.butt@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 0B55D20020 X-Stat-Signature: 7dde9i6bg4yrxoas1amqgtcodgz1cjgk X-HE-Tag: 1718644829-590822 X-HE-Meta: U2FsdGVkX1/G0yZJwtzXHZiD15LDTWo4XZ/FkqJf95qidMoRUxLQX4+BdJLsdUBdSVefIqoS5kbW1rXO0xFEMM6tCA9jlpq7RTB9aiH/2p00I82rTp7OuXJezH1GrD1RxAxc/ucXwThKGQO8aYh3zGIGdtIBjqgVn1dk9JvWGMmfvSH7mc6gM4dXj7e8R/xfeXbarEzNat9AzgQQiJqMk4QjhOjlt3Md9oPP+tdgTw1k+MCQkmWYhnfDwN53JelrO2q5L/uH4nH3hq0EW1ozpnPNrIOr1/NAlNQby0b0TC2RSgSIL4Sg0uYmeJ8/EHkEmNJKNwVIyDBEHXtnZOmb6nOVJLvdikrm2o3VsKFg1KR7eWYw/CDogm9xszW2sZix3HrNNvDFu2ZdekBd9uzjFi77nZz4snoF73StbW7NCvixFaUNQtmUOC7m5lwAQlyjyAkanLiJYGI9X7ptpqPeDOtUsUN00tr1IKuiAsp30EPRZjUr+YJ8T7K71lP7FcN9MwZEesJV7zKbmwA4WZxd/b8DBQiY888Me+IupMhSyYQnS3R/fQIU5hWdx+yVCFiOgN9TdVrjSgTvQkox3eECGqYvgUgvsyfVEubiX9R+K55f4gsQ78yG3B6lRgu38yomLeMUTyWc6sITHsw9a2oWh28jiAUQZSQHx3W6C6X5uE4ZCdQmnnj3ZSVHazkzo/Kg7S6rkE7S3VT1eVbF+GipNt693EMOIXEAfXw4VzhEoNP7CKC5mTbcnFIWwX5SI2FQ03bqGsPY5PwWOQdC7Ywt/ofTOWksOTHmrGEoU3pupc4DOL1r/phwalbhFwEnQ6yiskgY7WMB0DzxlM9r+8z7hijjDTpsrF6RVGTOJcBiuZ+uOsIkKdif90AU2yudBx08Gz7MKh5S5mgvuqbFDm0/UZ9ucOk58VXS2x79zes2gnq8VLzXPR1VTmnPtfI9TY2ymwofrjuhxr0V9lfsTDl 3BOtMRtw pJIyquw0OCxnK9jXz8M8Fo5n5Jz7YVCXIYoJ0g4plGCIzHdvbuDA9YBhklt7QWxg+zst1Q8NC0b1E4QoBpCjpq+Su/Eez/CJvyh1L2x25UxMF5po7oLfSErusWMeNrOq7kOFJ4GzMGLjFlTFRVT0iSpJySC9Gg2sVFnzs2aR9mTGI2rpeRVW5u85eA9jXVYM+IQFLFXmx2eUGGz0HxdhtMgbEecGtgr6O+ObtTswQSoOaK8o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Jun 15, 2024 at 05:28:55PM GMT, Yosry Ahmed wrote: > On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt wrote: > > > > The Meta prod is seeing large amount of stalls in memcg stats flush > > from the memcg reclaim code path. At the moment, this specific callsite > > is doing a synchronous memcg stats flush. The rstat flush is an > > expensive and time consuming operation, so concurrent relaimers will > > busywait on the lock potentially for a long time. Actually this issue is > > not unique to Meta and has been observed by Cloudflare [1] as well. For > > the Cloudflare case, the stalls were due to contention between kswapd > > threads running on their 8 numa node machines which does not make sense > > as rstat flush is global and flush from one kswapd thread should be > > sufficient for all. Simply replace the synchronous flush with the > > ratelimited one. > > > > One may raise a concern on potentially using 2 sec stale (at worst) > > stats for heuristics like desirable inactive:active ratio and preferring > > inactive file pages over anon pages but these specific heuristics do not > > require very precise stats and also are ignored under severe memory > > pressure. This patch has been running on Meta fleet for more than a > > month and we have not observed any issues. Please note that MGLRU is not > > impacted by this issue at all as it avoids rstat flushing completely. > > > > Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] > > Signed-off-by: Shakeel Butt > > --- > > mm/vmscan.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index c0429fd6c573..bda4f92eba71 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2263,7 +2263,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc) > > * Flush the memory cgroup stats, so that we read accurate per-memcg > > * lruvec stats for heuristics. > > */ > > - mem_cgroup_flush_stats(sc->target_mem_cgroup); > > + mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup); > > I think you already know my opinion about this one :) I don't like it > at all, Yup I know. > and I will explain why below. I know it may be a necessary > evil, but I would like us to make sure there is no other option before > going forward with this. Instead of necessary evil, I would call it a pragmatic approach i.e. resolve the ongoing pain with good enough solution and work on long term solution later. > > Unfortunately, I am travelling this week, so I probably won't be able > to follow up on this for a week or so, but I will try to lay down my > thoughts as much as I can. > > Why don't I like this? > > - From a high level, I don't like the approach of replacing > problematic flushing calls with the ratelimited version. It strikes me > as a whac-a-mole approach that is mitigating symptoms of a larger > problem. > > - With the added thresholding code, a flush is only done if there is a > significant number of pending updates in the relevant subtree. > Choosing the ratelimited approach is intentionally ignoring a > significant change in stats (although arguably it could be irrelevant > stats). Intentionally ignoring the significant change is fine for, as you said, for irrelevant stats but also for the cases where we don't need the exact and precise stats. See my next point. > > - Reclaim code is an iterative process, so not updating the stats on > every retry is very counterintuitive. We are retrying reclaim using > the same stats and heuristics used by a previous iteration, > essentially dismissing the effects of those previous iterations. > I think I explained in the commit message why we don't need the precise metrics for this specific case but let me reiterate. The stats are needed for two specific heuristics in this case: 1. Deactivate LRUs 2. Cache trim mode The deactivate LRUs heuristic is to maintain a desirable inactive:active ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to check if there is a refault since last snapshot and the LRU size are needed for the desirable ratio between inactive and active LRUs. See the table below on how the desirable ratio is calculated. /* total target max * memory ratio inactive * ------------------------------------- * 10MB 1 5MB * 100MB 1 50MB * 1GB 3 250MB * 10GB 10 0.9GB * 100GB 31 3GB * 1TB 101 10GB * 10TB 320 32GB */ The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate LRU size information to calculate this ratio. In addition, if deactivation is skipped for some LRU, the kernel will force deactive on the severe memory pressure situation. For the cache trim mode, inactive file LRU size is read and the kernel scales it down based on the reclaim iteration (file >> sc->priority) and only checks if it is zero or not. Again precise information is not needed. > - Indeterministic behavior like this one is very difficult to debug if > it causes problems. The missing updates in the last 2s (or whatever > period) could be of any magnitude. We may be ignoring GBs of > free/allocated memory. What's worse is, if it causes any problems, > tracing it back to this flush will be extremely difficult. This is indeed an issue but that is common with the heuristics in general. They work most of the time and fail for small set of cases. Anyways, I am not arguing to remove sync flush for all cases. Rather I am arguing for this specific case, we don't need to be precise as I have explained above. > > What can we do? > > - Try to make more fundamental improvements to the flushing code (for > memcgs or cgroups in general). The per-memcg flushing thresholding is > an example of this. For example, if flushing is taking too long > because we are flushing all subsystems, it may make sense to have > separate rstat trees for separate subsystems. Yes separate flushing for each subsystems make sense and can be done orthogonally. > > One other thing we can try is add a mutex in the memcg flushing path. > I had initially had this in my subtree flushing series [1], but I > dropped it as we thought it's not very useful. Currently in > mem_cgroup_flush_stats(), we check if there are enough pending updates > to flush, then we call cgroup_flush_stats() and spin on the lock. It > is possible that while we spin, those pending updates we observed have > been flushed. If we add back the mutex like in [1], then once we > acquire the mutex we check again to make sure there are still enough > stats to flush. > > [1]https://lore.kernel.org/all/20231010032117.1577496-6-yosryahmed@google.com/ My main beef with the global mutex is the possible priority inversion. Unless you agree to add try_lock() and skip flushing i.e. no one sleeps on the mutex, this is a no go. > > - Try to avoid the need for flushing in this path. I am not sure what > approach MGLRU uses to avoid the flush, but if we can do something > similar for classic LRUs that would be preferable. I am guessing MGLRU > may be maintaining its own stats outside of the rstat framework. MGLRU simply don't use these heuristics (Yu Zhao please correct me if I am wrong). > > - Try to figure out if one (or a few) update paths are regressing all > flushers. If one specific stat or stats update path is causing most of > the updates, we can try to fix that instead. Especially if it's a > counter that is continuously being increased and decreases (so the net > change is not as high as we think). This is actually a good point. I remember Jasper telling that MEMCG_KMEM might be the one with most updates. I can try to collect from Meta fleet what is the cause of most updates. > > At the end of the day, all of the above may not work, and we may have > to live with just using the ratelimited approach. But I *really* hope > we could actually go the other way. Fix things on a more fundamental > level and eventually drop the ratelimited variants completely. > > Just my 2c. Sorry for the long email :) Please note that this is not some user API which can not be changed later. We can change and disect however we want. My only point is not to wait for the perfect solution and have some intermediate and good enough solution. Thanks for the review.