From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2CD11C27C79 for ; Mon, 17 Jun 2024 18:01:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5297F6B0261; Mon, 17 Jun 2024 14:01:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 43D406B0266; Mon, 17 Jun 2024 14:01:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 244046B0269; Mon, 17 Jun 2024 14:01:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id F38AB6B0261 for ; Mon, 17 Jun 2024 14:01:45 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id AB286C19A6 for ; Mon, 17 Jun 2024 18:01:45 +0000 (UTC) X-FDA: 82241148570.03.4A12723 Received: from out-180.mta1.migadu.com (out-180.mta1.migadu.com [95.215.58.180]) by imf15.hostedemail.com (Postfix) with ESMTP id 09FD0A0028 for ; Mon, 17 Jun 2024 18:01:42 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="kUbRO/m7"; spf=pass (imf15.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.180 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718647298; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=c9q5pk9eheQqKt91dvfdalJZi0Mslh6TzikYYicPd/c=; b=Q7ju/F56Bc0Kk61KXxs/aaKNcQFfMHv/JW6NjS5WLM0/rG66IMXk4On+zIEjNE5huZ0hyQ hOzT0EKfE8Y6U9TZs1v5kp/A+bJ3WmU/LgmqmorF5bYQrdgkFEzpLZl4HSm1xUFS1YTyiA T/aLDCFb2O7I+QlYaXqfXuoISvmMzNY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718647298; a=rsa-sha256; cv=none; b=529m8T4MiC+p9T7mUJbPPRo2XcVk2QcqYGC3BDn0dqZwPLCLc5jV+TEDa207pA/ivAEi13 9jtsEmfSzAdodaooutdOPa02/zLl3URHkjLhljUcccF2piTeCpkb6RhMgB7S7j8gmlYnKb skpzcbSaVuIeXrGZDJS/zP2D2eyXrRU= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="kUbRO/m7"; spf=pass (imf15.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.180 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev X-Envelope-To: hawk@kernel.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1718647300; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=c9q5pk9eheQqKt91dvfdalJZi0Mslh6TzikYYicPd/c=; b=kUbRO/m7Cu0gDkmawgzPLhotwcpg4Q4zhwOTsLhO5hpmTvKBJWa5Btfwj5Jc9YRzj3URAk h5SsTEzrKHXlpAufDQwPKsRGC9R3pGxGzqiodXhC+p9CLu36vdyUcaHjjCQgLknB3h6jo/ R7nJwLqoOrVyUspCsIrp4Oa+PDxjwPY= X-Envelope-To: yosryahmed@google.com X-Envelope-To: akpm@linux-foundation.org X-Envelope-To: hannes@cmpxchg.org X-Envelope-To: mhocko@suse.com X-Envelope-To: roman.gushchin@linux.dev X-Envelope-To: yuzhao@google.com X-Envelope-To: songmuchun@bytedance.com X-Envelope-To: kernel-team@meta.com X-Envelope-To: linux-mm@kvack.org X-Envelope-To: linux-kernel@vger.kernel.org X-Envelope-To: kernel-team@cloudflare.com Date: Mon, 17 Jun 2024 11:01:36 -0700 X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Jesper Dangaard Brouer Cc: Yosry Ahmed , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Yu Zhao , Muchun Song , Facebook Kernel Team , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim Message-ID: References: <20240615081257.3945587-1-shakeel.butt@linux.dev> <0ec3c33c-d9ff-41a5-be94-0142f103b815@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <0ec3c33c-d9ff-41a5-be94-0142f103b815@kernel.org> X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 09FD0A0028 X-Stat-Signature: q7tad6c7okwsnim1sgeqxwfyskswm49s X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1718647302-31864 X-HE-Meta: U2FsdGVkX1+0mai4S4MVtgF6T9Ci3N59WqGcLQs9LoSw8KXGBJT+68kZZ2MFIjoGuMvXDmJ52iF3u5v3SMiNZsLB894UPZ5GKhUV14GaGDqLnRVLGZG10s39QsshLCoklYZgQ6owbEApgjUIpoDH3LzOIFcALq+EHblQ2dkwTbok5mICKei5UgyDaqP1E4toQcFSfYRutqSF6GlPZAHCWiPjPvtw/7s46fSI0NKQaHvmR6Fu9xM2a0e5dlOemS8nF8vFe2gbp2nUZ87RbJIwB25Nf6Nddi9q95HsiJSQXqgDXmiB6ihfqQ1CGreSM9Fp2v8y0lDR0ViPPgUJpf47p5Pe7BsBNUP7z2fbxOi5VmJ5MjnpLbTSYOI6sCymU3iMVTs3INFgW8v/NKuVrg7Dz1YblB7l1ILGMxd8t2sr2RUtcOAZhMcgGueLEV6hp+4X/tCA4RgnICy8mS2/2Ry2U7xY0GiPe2v3wc9cCT8EBACPU6y0D7WgJ1zAOOxBaSpxlfRi+wbcwvwUVRFSympq50JJSMhbbH9Bd1cckoxIJEt7NqozMHmQMRPIRDMXO+yQSdwPIIPRPOI0x63TlSKYkTOGvx50+8VGi5zl9dBP++rcsVSlh2sarx0HEDVX7VvWFDAbE4vIdvW0O5zV4oMGjOr51ghAiV9SqfK88zhzFjH2rNVmuj0kcPqXdDpSqLUyH8h2yrUSWGCUCi8s2y9KA2GrUtI6fFVUqM55puPAaguGAWf0iE5gTatk14LtcK/K5fXBP6TCBY7IFjrAxhjiXxc+2tmWh58In0LrmHulKbfi658iItn5iohKnqFM1wRngM6aZkF7o84cxoHTbRw8HKE6D5GSoK4AMi4gxPEynRFcigKzioRtk8q1Mqchy9FSPGiCO3Vl5QA1Q5qYTRkeZBNIU55mGbW6cDQj/wW0L9Qkie/dOvLOoKb5lgWNzSuxjnlSQVK40zksmC3Tiwb ynxtT6EW 9kDbgbAKFj41Lt12fE1T3AILiUkONaxjdcleyh7/PJxWwYpd2IuH/zk7U0GYgpR8eOBCto6DDuiZ4N6ZvicQBwBXhPi7QDj/EjJPBBUNxcLXIjUJWbX8SBCtJnhlCF3BJ4OhUn1ns0kXCXcqDPdqh+h2Nmg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 17, 2024 at 05:31:21PM GMT, Jesper Dangaard Brouer wrote: > > > On 16/06/2024 02.28, Yosry Ahmed wrote: > > On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt wrote: > > > > > > The Meta prod is seeing large amount of stalls in memcg stats flush > > > from the memcg reclaim code path. At the moment, this specific callsite > > > is doing a synchronous memcg stats flush. The rstat flush is an > > > expensive and time consuming operation, so concurrent relaimers will > > > busywait on the lock potentially for a long time. Actually this issue is > > > not unique to Meta and has been observed by Cloudflare [1] as well. For > > > the Cloudflare case, the stalls were due to contention between kswapd > > > threads running on their 8 numa node machines which does not make sense > > > as rstat flush is global and flush from one kswapd thread should be > > > sufficient for all. Simply replace the synchronous flush with the > > > ratelimited one. > > > > > Like Yosry, I don't agree that simply using ratelimited flush here is > the right solution, at-least other options need to be investigated first. I added more detail in my reply to Yosry on why using ratelimited flush for this specific case is fine. [...] > > > > I think you already know my opinion about this one :) I don't like it > > at all, and I will explain why below. I know it may be a necessary > > evil, but I would like us to make sure there is no other option before > > going forward with this. > > > I'm signing up to solving this somehow, as this is a real prod issue. > > An easy way to solve the kswapd issue, would be to reintroduce > "stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm: > memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and > introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an > atomic") (Author: Yosry Ahmed). > The skipping flush for "stats_flush_ongoing" was there from the start. > The concept is: If there is an ongoing rstat flush, this time limited to > the root cgroup, then don't perform the flush. We can only do this for > the root cgroup tree, as flushing can be done for subtrees, but kswapd > is always for root tree, so it is good enough to solve the kswapd > thundering herd problem. We might want to generalize this beyond memcg. > No objection from me for this skipping root memcg flush idea. > [...] > > > - With the added thresholding code, a flush is only done if there is a > > significant number of pending updates in the relevant subtree. > > Choosing the ratelimited approach is intentionally ignoring a > > significant change in stats (although arguably it could be irrelevant > > stats). > > > > My production observations are that the thresholding code isn't limiting > the flushing in practice. > Here we need more production data. I remember you mentioned MEMCG_KMEM being used for most of the updates. Is it possible to get top 5 (or 10) most updated stats for your production environment? > > > - Reclaim code is an iterative process, so not updating the stats on > > every retry is very counterintuitive. We are retrying reclaim using > > the same stats and heuristics used by a previous iteration, > > essentially dismissing the effects of those previous iterations. > > > > - Indeterministic behavior like this one is very difficult to debug if > > it causes problems. The missing updates in the last 2s (or whatever > > period) could be of any magnitude. We may be ignoring GBs of > > free/allocated memory. What's worse is, if it causes any problems, > > tracing it back to this flush will be extremely difficult. > > > > The 2 sec seems like a long period for me. > > > What can we do? > > > > - Try to make more fundamental improvements to the flushing code (for > > memcgs or cgroups in general). The per-memcg flushing thresholding is > > an example of this. For example, if flushing is taking too long > > because we are flushing all subsystems, it may make sense to have > > separate rstat trees for separate subsystems. > > > > One other thing we can try is add a mutex in the memcg flushing path. > > I had initially had this in my subtree flushing series [1], but I > > dropped it as we thought it's not very useful. > > I'm running an experimental kernel with rstat lock converted to mutex on > a number of production servers, and we have not observed any regressions. > The kswapd thundering herd problem also happen on these machines, but as > these are sleep-able background threads, it is fine to sleep on the mutex. > Sorry but a global mutex which can be taken by userspace applications and is needed by node controller (to read stats) is a no from me. On a multi-tenant systems, global locks causing priority inversion is a real issue. > [...] > > My pipe dream is that kernel can avoiding the cost of maintain the > cgroup threshold stats for flushing, and instead rely on a dynamic time > based threshold (in ms area) that have no fast-path overhead :-P > Please do expand on what you mean by dynamic time based threshold.