From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F250BC2BA18 for ; Tue, 18 Jun 2024 15:53:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 656776B030E; Tue, 18 Jun 2024 11:53:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 607BF6B030F; Tue, 18 Jun 2024 11:53:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A8136B0310; Tue, 18 Jun 2024 11:53:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 28D416B030E for ; Tue, 18 Jun 2024 11:53:20 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A0064408DD for ; Tue, 18 Jun 2024 15:53:19 +0000 (UTC) X-FDA: 82244453718.12.D7681BD Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf24.hostedemail.com (Postfix) with ESMTP id 360DB180008 for ; Tue, 18 Jun 2024 15:53:15 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="YvJYn+g/"; spf=pass (imf24.hostedemail.com: domain of hawk@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718725987; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3QvAEWX9MYl7G6oAkk99is3BuNZ353408KpiHPppjoc=; b=bJOUPmZtqfp2nV9057iC3i54bc4+NyCBIMXXGUL4vNyW1ZwddPXvYEAP4fows68jI+7/5Q PP1aTx45v84o01ifl4CCCs7FwvrAWlarUfEa9g8FbmEqhevANR+tF4K5ogrnHGK+eLcX+h s3nks/2s9nV8fvtts42M/19BDNN+/7k= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="YvJYn+g/"; spf=pass (imf24.hostedemail.com: domain of hawk@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718725987; a=rsa-sha256; cv=none; b=yCs7y8xLPJ/JZbgnidtRcBlRjotFcxz0jzk6Qt+UeR7zfhJjiL04ePne8TbdwlbtkpBIvx O/yj0ICDQE06WNb23lAK0uMERtKwDCKRk8xB7ySYSU3rSjP9bP5AIf51+KbC5kVhNcbpal Ekl2cJuodgz7cjfVhMV5rMQ2GHVIQVA= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 9C6A4CE131A; Tue, 18 Jun 2024 15:53:11 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 033EAC3277B; Tue, 18 Jun 2024 15:53:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1718725990; bh=noV/t16q3ShzIWiG9+8kEBkRm2nEsjjbET+T9hmHC1E=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=YvJYn+g/2QmHr39iPERPD/2+j/QM7uUgcwSOzzKr4dymfTAqFQxplC3dbm9+M1jfQ mQz+7Fi4yGFLj0lZ8zzElvh82+iZVBnjTO2PS/S9L+7e41RSnoJp7mrCfpyg6Rsqdx hDbMtuIl0lhHMPvedayAuZSfNGUS24GWT5cUUFHJwFBorw8Px969fY0AGAM39doVxm lCAFRflY/gvgX4XQ/yU8dMkDthDmV3hmf9p9BpbevQn3FPh7W4OLDVFL8OnT97aSYX 8QEd8TViEnlC4oBzRK0s7xGyDo4om35SfI5xCYCRluXteppDwFtIqrPOpGzAO/LtEN HmDn/lcr0X+Ew== Message-ID: Date: Tue, 18 Jun 2024 17:53:02 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim To: Shakeel Butt Cc: Yosry Ahmed , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Yu Zhao , Muchun Song , Facebook Kernel Team , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team References: <20240615081257.3945587-1-shakeel.butt@linux.dev> <0ec3c33c-d9ff-41a5-be94-0142f103b815@kernel.org> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 360DB180008 X-Stat-Signature: rzrw9hsch6burzytk94adk5udf5oucag X-Rspam-User: X-HE-Tag: 1718725995-11423 X-HE-Meta: U2FsdGVkX185zVQJuqKg7oEQjIFWljRVVdWG3fqWLB6xL+pUQM3rZs8ANNnOFj3IhZYAL3INIpOW+7Gj8dKGLdX89cTdWKNV6w6DEI1jYLEL7pnSA/f3LukBJf3ahKe8La+trG607UwLUSG5hOW/kK+ZHrCwERmTjpNqc0SD/eHv3RNqAZ6vZut4OUiB6tkMYLyhLHRDr4kHXBlnlrvaA6vTIt+oXW2Rp9me9H8MaET39JaEK8nAV9V2gSx4jJUXiqft3DEriQLtLIT5QHa4l9SLxrQ4zDFDR8plFppU9k50zXK/S/sB61ReyFd/ubN7GpjybaTjNLFj06vv4Njej5uNEsVgMNgq24zWx7Tz+L/i2Dh2BY0r65E9eKfnjl2JJ31T9akMI4UjQdhWjOEx8dU8PtpxxmOaFh/ohqUd07cg3Ex2LzSCY6TNLPWmVZNmM8TRypYfZsTiT3bIhPeLItgrLL5J61nidFOLpUwIJhTwDuXiaikboUYD+oSK5fkRLbu1RY2gE4x7zJPNjVaIrLVuG8XsIE/lehECvVajqUEuiZsHuFLc/rFuL2QIXE7tEep+/mrjA/del0VSp5HxvHrzd2UdKCFRj4C7C7DXfp+DdYBRRiRoih0iksWdsIlNRKgKzrqjhAReuAZamhnKheLtT9AwrGMEtDplLGAgLGxXVuLmm/onHxzmEpcPD5b6qzR4E0Dl0cYUJzzRPWFJMjA1AAuHxZg7IsHgUyzp6EcIfsoevAC3rE9bPei8YAnxzWGm4SYdENKvM1iS9iZG4dqSKWBHyRzPBTNyDCqxde9KsvEVYp40PgcbHoEwmMMjCHFTCjdYAeiolL/TvssoFq/o2pXOYihq6DZVlaxTw3JcxaIon6wIdLFqkhAthFMuWEvMcIP03QY0DDQl94ttLw8edQASGaUH9ASyDqH7Db1FwP50MI1UXpa38iqD9eQswxGRnvw1kDnk5WzUUtr qCuN1v6N op1s5+UQUx/D7IlMYzgawxlm0HfXHPwm/C+50Agl84W6y1uQ6m3rVsxH9EmwsO0vXPNhYgqL4RHNaiBrhG51vSQIQ2O0Vk0pvnYNj8d63aPTPeYJA4Diw7686lq1iokYkajjRUQAjkbjy/w8Hen7B5C8aqC4aV18CJX1RfiMxz3DRgmbmmro5MDh32laflFFcAyI/l6x5C7QMfc4fIDKh9IrhRr8RAKz+NJhS8iGcQBu47COXkZj8fgV6zVE1EMoPrb2s/sIJ0ofM2A8G7ENvO/nsqCJFPuAP334w7etIF+11nN8MFXXvNCtZOlqANOM/DxVnaaxG9UhfG/CvYffrDrKUbY2SGcb55iS6nwdFComFaJkPVUbqC/mZmA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 17/06/2024 20.01, Shakeel Butt wrote: > On Mon, Jun 17, 2024 at 05:31:21PM GMT, Jesper Dangaard Brouer wrote: >> >> >> On 16/06/2024 02.28, Yosry Ahmed wrote: >>> On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt wrote: >>>> >>>> The Meta prod is seeing large amount of stalls in memcg stats flush >>>> from the memcg reclaim code path. At the moment, this specific callsite >>>> is doing a synchronous memcg stats flush. The rstat flush is an >>>> expensive and time consuming operation, so concurrent relaimers will >>>> busywait on the lock potentially for a long time. Actually this issue is >>>> not unique to Meta and has been observed by Cloudflare [1] as well. For >>>> the Cloudflare case, the stalls were due to contention between kswapd >>>> threads running on their 8 numa node machines which does not make sense >>>> as rstat flush is global and flush from one kswapd thread should be >>>> sufficient for all. Simply replace the synchronous flush with the >>>> ratelimited one. >>>> >> >> Like Yosry, I don't agree that simply using ratelimited flush here is >> the right solution, at-least other options need to be investigated first. > > I added more detail in my reply to Yosry on why using ratelimited flush > for this specific case is fine. > > [...] >>> >>> I think you already know my opinion about this one :) I don't like it >>> at all, and I will explain why below. I know it may be a necessary >>> evil, but I would like us to make sure there is no other option before >>> going forward with this. >>> >> I'm signing up to solving this somehow, as this is a real prod issue. >> >> An easy way to solve the kswapd issue, would be to reintroduce >> "stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm: >> memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and >> introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an >> atomic") (Author: Yosry Ahmed). >> > > The skipping flush for "stats_flush_ongoing" was there from the start. > >> The concept is: If there is an ongoing rstat flush, this time limited to >> the root cgroup, then don't perform the flush. We can only do this for >> the root cgroup tree, as flushing can be done for subtrees, but kswapd >> is always for root tree, so it is good enough to solve the kswapd >> thundering herd problem. We might want to generalize this beyond memcg. >> > > No objection from me for this skipping root memcg flush idea. > Okay, looking into coding this up then. >> > [...] >> >>> - With the added thresholding code, a flush is only done if there is a >>> significant number of pending updates in the relevant subtree. >>> Choosing the ratelimited approach is intentionally ignoring a >>> significant change in stats (although arguably it could be irrelevant >>> stats). >>> >> >> My production observations are that the thresholding code isn't limiting >> the flushing in practice. >> > > Here we need more production data. I remember you mentioned MEMCG_KMEM > being used for most of the updates. Is it possible to get top 5 (or 10) > most updated stats for your production environment? > Do you have a method for obtaining these stats? Last time I used eBPF + bpftrace to extract these stats. The high rate these updates occur, it caused production overload situations that SRE noticed. So, I have to be careful when producing these stats for you. Also could you be more specific what code lines you want stats for? >> >>> - Reclaim code is an iterative process, so not updating the stats on >>> every retry is very counterintuitive. We are retrying reclaim using >>> the same stats and heuristics used by a previous iteration, >>> essentially dismissing the effects of those previous iterations. >>> >>> - Indeterministic behavior like this one is very difficult to debug if >>> it causes problems. The missing updates in the last 2s (or whatever >>> period) could be of any magnitude. We may be ignoring GBs of >>> free/allocated memory. What's worse is, if it causes any problems, >>> tracing it back to this flush will be extremely difficult. >>> >> >> The 2 sec seems like a long period for me. >> >>> What can we do? >>> >>> - Try to make more fundamental improvements to the flushing code (for >>> memcgs or cgroups in general). The per-memcg flushing thresholding is >>> an example of this. For example, if flushing is taking too long >>> because we are flushing all subsystems, it may make sense to have >>> separate rstat trees for separate subsystems. >>> >>> One other thing we can try is add a mutex in the memcg flushing path. >>> I had initially had this in my subtree flushing series [1], but I >>> dropped it as we thought it's not very useful. >> >> I'm running an experimental kernel with rstat lock converted to mutex on >> a number of production servers, and we have not observed any regressions. >> The kswapd thundering herd problem also happen on these machines, but as >> these are sleep-able background threads, it is fine to sleep on the mutex. >> > > Sorry but a global mutex which can be taken by userspace applications > and is needed by node controller (to read stats) is a no from me. On a > multi-tenant systems, global locks causing priority inversion is a real > issue. > The situation we have *today* with a global IRQ-disabling spin_lock is precisely causing a priority-inversion situation always by design. Userspace applications (reading stat file) and kswapd background processes are currently getting higher priority than both hardware and software interrupts. This is causing actual production issues, which is why I'm working on this. I do understand that changing this to a global mutex creates the theoretical *possibility* for priority-inversion between processes with different configured priorities. IMHO this is better than always taking the "big" bottom-half-lock [LWN]. I still want to lower the potential priority-inversion issue with the mutex lock, by (1) working on lowering the pressure on the lock (e.g. exit if flush is ongoing on root, and e.g. add time limit on how often flush can run on sub-trees), and (2) we will collect production metrics for lock contention and hold time (with appropriate alerts). [LWN] https://lwn.net/SubscriberLink/978189/5f50cab8478fac45/ >> > [...] >> >> My pipe dream is that kernel can avoiding the cost of maintain the >> cgroup threshold stats for flushing, and instead rely on a dynamic time >> based threshold (in ms area) that have no fast-path overhead :-P >> > > Please do expand on what you mean by dynamic time based threshold. I proposed a fixed 50 ms flush rate limiting in [2]. But I don't want this to be a static value as it will vary on different configs and hardware. I propose making this dynamic via measuring the time it takes to flush the cgroup root. This makes sense in a queuing theory context, because this corresponds to the (longest) service time, and theory say "a queue is formed when customers arrive faster than they can get served". Thus, this should lower the pressure/contention on the lock, while still allowing frequent flushing. Hope it makes sense. --Jesper [2] https://lore.kernel.org/all/171328990014.3930751.10674097155895405137.stgit@firesoul/