From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 288B7C2BD09 for ; Mon, 24 Jun 2024 12:58:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B8F86B03D5; Mon, 24 Jun 2024 08:58:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 968AE6B03D7; Mon, 24 Jun 2024 08:58:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 809286B03D8; Mon, 24 Jun 2024 08:58:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 62C616B03D5 for ; Mon, 24 Jun 2024 08:58:33 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D95DB1A0FA6 for ; Mon, 24 Jun 2024 12:58:32 +0000 (UTC) X-FDA: 82265786064.14.0CAA9B4 Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by imf28.hostedemail.com (Postfix) with ESMTP id 0F113C0013 for ; Mon, 24 Jun 2024 12:58:30 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VfwGkmzS; spf=pass (imf28.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719233897; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=svysHleJ3ThFEGjJZROMuYTbIejZlxCKTHG9F3C7waE=; b=J5cACEPeOOmT9WCgmBj6BKSbgCeF2RjPVmEv1Z4VXVkVFawVlQGMWWjdLW7oHh8Fvo2xmV G1FIpDOFeTyS2qCfPhWwoBbIn2c/oMHh+epIXLrLUbKs02HrDLjKAoXcli36iGzgt9pHoi lIaCM8ixcmDnFMwuCvby8h8Tet8vr1U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719233897; a=rsa-sha256; cv=none; b=5OGUvw60slaWpk/lC4gH6K93zH3vF9GROhTyjhFAKMRTi5VD0A0KS+cSiWYNoXWvEpDaiz GpeABsZueADSWfbPqMwy7MJZtipEQ/qZcPYP5/wlpd3K/dRdj8HCqQYubac/p6E+MFrmgJ RSNv9tFuUxvFoAa6QaPi47NW38ff2LU= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VfwGkmzS; spf=pass (imf28.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-52cdf9f934fso1803135e87.1 for ; Mon, 24 Jun 2024 05:58:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719233909; x=1719838709; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=svysHleJ3ThFEGjJZROMuYTbIejZlxCKTHG9F3C7waE=; b=VfwGkmzSBvEPW4Y5XDta6ulHF9qN144PtIhtmbLfsAC2azaMsPW30TkLEk9f+lxvuY l8unVXxDTBDOTIE5Gi664+aHlcYUGDhDwyWUtEv5W0rpWqqNBDMWPlrmiO9WP8+v4LEe pWzPD2/pixwjXPpoK9I+r5FmsA/ZfZ9lgXlX2OpIgNHd1nPh8iVOAmzynwcf8PitteDs J8Q10LR9VcF0OJN9MOintagHaXE4oMp7j4CBagkDhnruZD1wrRRnRtCgRMlnooTwHmrD KmmVfmP9nI3VkAwZ0xio64tyQPB7VVSuZYfwBhUq8FCGt7o4Rb6mYzseWi/P0ZET5eVA /cXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719233909; x=1719838709; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=svysHleJ3ThFEGjJZROMuYTbIejZlxCKTHG9F3C7waE=; b=fZDSbDXbpZW2GgxKCRFcNdJNzrJclVtAh4Pjpgmq/5KgSIVsi1BfpXYtRn+hG6U1JG aGR1rneItDMFcaKt9bn+YcC0rdpuZ5ygNbbfyvJFeCcA9AZKOJa19T15nUbhtZk3SXt4 F7B7GPuWM3T7KleZ0qPk5dEJGY2a9RvJl1eZJmJ+YBde51s1f/oAIevYT1NqwU+qwH5v bIQx3q16QwmOV0tEo00L10ncPN8vdZlnk4/4/MxJbc6uyNplnKN6JVc+hTkjXXyLsIAn YOrxO+aX+zKB7VrU4GIHAszcJ5WSS7XBY6zkpSESlCSLIUPJTa7yiyjF+/dKOygqkssR Mk9Q== X-Forwarded-Encrypted: i=1; AJvYcCX43UnBOJk7pHafNJhNu04B5qG3zPXgO2VKhtZprILDosFwe/L1/1UmymkGf4MkaLprlMACI8PBLpJNaW1j4rAiYVE= X-Gm-Message-State: AOJu0YwwGfdnK0FocoqFD8ijgO91StrXwiGrfzoeWFvyqurMMJj02gQv KbI7SocZvpP6xk7Fp9jw4OBObRV2kRluf0lQM3VEOSBU0LGNPiGmHMs2jsfnBXifmR0QfVZVc4a S89VP90pIjqYryAsh0Spxe8ATjqH6lsnggImd X-Google-Smtp-Source: AGHT+IHR7EnIshHsOqxNajTCo/Fv6RaPV0imh+5nzrEusg/Kf5PiQf2EoH4z2P2/4yiiFNpQtoqoMY20WvudUWJbugw= X-Received: by 2002:a05:6512:39c6:b0:52b:c0b1:ab9e with SMTP id 2adb3069b0e04-52ce063d750mr3852298e87.5.1719233908660; Mon, 24 Jun 2024 05:58:28 -0700 (PDT) MIME-Version: 1.0 References: <20240615081257.3945587-1-shakeel.butt@linux.dev> In-Reply-To: From: Yosry Ahmed Date: Mon, 24 Jun 2024 05:57:51 -0700 Message-ID: Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim To: Shakeel Butt Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Jesper Dangaard Brouer , Yu Zhao , Muchun Song , Facebook Kernel Team , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: f436q9jkineaz3cx178p51ro59pjn1hy X-Rspamd-Queue-Id: 0F113C0013 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1719233910-733130 X-HE-Meta: U2FsdGVkX1/XRS8A9Z+nlqxRqDi/tzTmxFo51ENYlEAOua8UBLiMRuT7yKYB83eSi8p2igXjNvwOf9Jj8WuRD23a8VZjrjFoNS91Uo3RaopKnH7WgKoalMgArCVYubo9jf9w+e2n+xUCupvrZu07GBt7WX+SIpxqMd07T8ahLWgT5vqjYHIMr82HoRYLXS8AjOwuDUb8utguosqPeKzOHf5VhfeRt9XrpCW9wzoO8FaXYQLZl9KBnzhMnEtD7qTQ6umYFFNJVOJFccU3JnkYzafEWoKG1xtmGVjy/MecFfN9Ob7XlbC3yJaNeXWWYQlYeqh/gq1aAMD/6+VPKWI1qKeyy+OzAX4jn8PMvZXV2yiRPYRRDelO1mVqQqSG+pOxnvbhIx0C9dcnBbYrLEImqYQX+4Wq+ZA7ggJm9w315DV8BcthuHtkVKvv0NTg+dGc2hIUQcrBFppvAIMIGiL1MTDTlaJ/jiy1bagfEd+HXtcETXK9QOD8DwOiWfLt7fWZpADFkcopaxmsR5PTX0P/ECqe0eKuFtGrsFrVxF6LvfkVHXNr5EqDPG1wDKa2Vk/kBPVUKlGOvlQ9Yo8LifxT6MpizO3mH7Q0ujmKuajjFjQuSdH2GaZxWLq3/nOsFqXES46FvZJnOt8izy5Fc/9sIb0xxYrXfUBrNv12rcJ5wMPgK//EsMWREWCd1Hv2D+JCNpSHNBq2qwTMN8rNF33igMm7a48LUPxPgtR8CV7cfMjpdfQMzzf3t6G/DYJXXqMc3WW2w8hZyRAnX8bMIlCDg/lL1Tk4UyQ58pqUNgHRPlMmuCJ7D5OuqxIF5bYBh7eacyQDicM1G1h6vSQUQHH+JlfF6Wd0ugFAR34/I9To2i1SCwglnifSwUkuPfCgqC4vvIus8DwCIc1r7nM6mat1EiHhLLOcxlkE02heOL+MxtG73TyFvlO4Acsh3S4ZwrCTbzKIFvAdPWa6VMVgL+b FknrQrF7 rHyJ0Qy2nIFD8sJ3HEhRTqi80qx+sFSSbGd8sLuwLGdwMPQK50kevSt+npqkRkm+dC0/bL9JXxj/mNV075CQj2MGZeJLN296YYvWz324A7uQvmVa6cRoSz/N8Wu9bunIBkpEUykqKQAeDT3nPJdjt6j84ZiQKnZ61VUT9q3sPcADAwPBZJBa1cv7T2LSUET9+yxMbC5fIWjKoafzke43kjIXqO1cjx+OT8FA2Ud5abd5/jbv/MRzyWHx7JulY5WBOLuGz X-Bogosity: Ham, tests=bogofilter, spamicity=0.000223, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: > > and I will explain why below. I know it may be a necessary > > evil, but I would like us to make sure there is no other option before > > going forward with this. > > Instead of necessary evil, I would call it a pragmatic approach i.e. > resolve the ongoing pain with good enough solution and work on long term > solution later. It seems like there are a few ideas for solutions that may address longer-term concerns, let's make sure we try those out first before we fall back to the short-term mitigation. [..] > > > > - Reclaim code is an iterative process, so not updating the stats on > > every retry is very counterintuitive. We are retrying reclaim using > > the same stats and heuristics used by a previous iteration, > > essentially dismissing the effects of those previous iterations. > > > > I think I explained in the commit message why we don't need the precise > metrics for this specific case but let me reiterate. > > The stats are needed for two specific heuristics in this case: > > 1. Deactivate LRUs > 2. Cache trim mode > > The deactivate LRUs heuristic is to maintain a desirable inactive:active > ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* > and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to > check if there is a refault since last snapshot and the LRU size are > needed for the desirable ratio between inactive and active LRUs. See the > table below on how the desirable ratio is calculated. > > /* total target max > * memory ratio inactive > * ------------------------------------- > * 10MB 1 5MB > * 100MB 1 50MB > * 1GB 3 250MB > * 10GB 10 0.9GB > * 100GB 31 3GB > * 1TB 101 10GB > * 10TB 320 32GB > */ > > The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, > 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate > LRU size information to calculate this ratio. In addition, if > deactivation is skipped for some LRU, the kernel will force deactive on > the severe memory pressure situation. Thanks for explaining this in such detail. It does make me feel better, but keep in mind that the above heuristics may change in the future and become more sensitive to stale stats, and very likely no one will remember that we decided that stale stats are fine previously. > > For the cache trim mode, inactive file LRU size is read and the kernel > scales it down based on the reclaim iteration (file >> sc->priority) and > only checks if it is zero or not. Again precise information is not > needed. It sounds like it is possible that we enter the cache trim mode when we shouldn't if the stats are stale. Couldn't this lead to over-reclaiming file memory? > > > - Indeterministic behavior like this one is very difficult to debug if > > it causes problems. The missing updates in the last 2s (or whatever > > period) could be of any magnitude. We may be ignoring GBs of > > free/allocated memory. What's worse is, if it causes any problems, > > tracing it back to this flush will be extremely difficult. > > This is indeed an issue but that is common with the heuristics in > general. They work most of the time and fail for small set of cases. > > Anyways, I am not arguing to remove sync flush for all cases. Rather I > am arguing for this specific case, we don't need to be precise as I have > explained above. > > > > > What can we do? > > > > - Try to make more fundamental improvements to the flushing code (for > > memcgs or cgroups in general). The per-memcg flushing thresholding is > > an example of this. For example, if flushing is taking too long > > because we are flushing all subsystems, it may make sense to have > > separate rstat trees for separate subsystems. > > Yes separate flushing for each subsystems make sense and can be done > orthogonally. > > > > > One other thing we can try is add a mutex in the memcg flushing path. > > I had initially had this in my subtree flushing series [1], but I > > dropped it as we thought it's not very useful. Currently in > > mem_cgroup_flush_stats(), we check if there are enough pending updates > > to flush, then we call cgroup_flush_stats() and spin on the lock. It > > is possible that while we spin, those pending updates we observed have > > been flushed. If we add back the mutex like in [1], then once we > > acquire the mutex we check again to make sure there are still enough > > stats to flush. > > > > [1]https://lore.kernel.org/all/20231010032117.1577496-6-yosryahmed@google.com/ > > My main beef with the global mutex is the possible priority inversion. > Unless you agree to add try_lock() and skip flushing i.e. no one sleeps > on the mutex, this is a no go. Jesper is working on ways to mitigate the possible priority inversion AFAICT. Let's see what comes out of this (I commented on Jesper's patch). > > > > > - Try to avoid the need for flushing in this path. I am not sure what > > approach MGLRU uses to avoid the flush, but if we can do something > > similar for classic LRUs that would be preferable. I am guessing MGLRU > > may be maintaining its own stats outside of the rstat framework. > > MGLRU simply don't use these heuristics (Yu Zhao please correct me if I > am wrong). > > > > > - Try to figure out if one (or a few) update paths are regressing all > > flushers. If one specific stat or stats update path is causing most of > > the updates, we can try to fix that instead. Especially if it's a > > counter that is continuously being increased and decreases (so the net > > change is not as high as we think). > > This is actually a good point. I remember Jasper telling that MEMCG_KMEM > might be the one with most updates. I can try to collect from Meta fleet > what is the cause of most updates. Let's also wait and see what comes out of this. It would be interesting if we can fix this on the update side instead. > > > > > At the end of the day, all of the above may not work, and we may have > > to live with just using the ratelimited approach. But I *really* hope > > we could actually go the other way. Fix things on a more fundamental > > level and eventually drop the ratelimited variants completely. > > > > Just my 2c. Sorry for the long email :) > > Please note that this is not some user API which can not be changed > later. We can change and disect however we want. My only point is not to > wait for the perfect solution and have some intermediate and good enough > solution. I agree that we shouldn't wait for a perfect solution, but it also seems like there are a few easy-ish solutions that we can discover first (Jesper's patch, investigating update paths, etc). If none of those pan out, we can fall back to the ratelimited flush, ideally with a plan on next steps for a longer-term solution.