From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1A13BE67A96 for ; Tue, 3 Mar 2026 08:28:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3A11D6B0089; Tue, 3 Mar 2026 03:28:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 34EE36B008A; Tue, 3 Mar 2026 03:28:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 230C56B008C; Tue, 3 Mar 2026 03:28:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 11F216B0089 for ; Tue, 3 Mar 2026 03:28:05 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 7F4F18B5BF for ; Tue, 3 Mar 2026 08:28:04 +0000 (UTC) X-FDA: 84504074088.12.B279FB2 Received: from cu-gy11p00im-quki07202302.gy.silu.net (cu-gy11p00im-quki07202302.gy.silu.net [157.255.1.68]) by imf06.hostedemail.com (Postfix) with ESMTP id C8E36180002 for ; Tue, 3 Mar 2026 08:28:01 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=icloud.com header.s=1a1hai header.b=mAVYbdvl; spf=pass (imf06.hostedemail.com: domain of bfguo@icloud.com designates 157.255.1.68 as permitted sender) smtp.mailfrom=bfguo@icloud.com; dmarc=pass (policy=quarantine) header.from=icloud.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1772526482; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YCApGIaGSKZC8cCTCH+72YgshK9JNaLJszY1i6IDUso=; b=oHvtsEA/XhIgW0HjQM/mJhgL0HC/0syTHDvhjjzjyOWZG7BpHQk8wFQzo+wls5CCN6C/7J aHT8mqhTQfUugVNThC+/bxQSvWYWLHZ+3SHVipNFzQc+x16HMsMLT3S4CqtvOfQtyXWjVn ZZzsY/ief3KxYciBfyPUyOKmqOVaj3I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1772526482; a=rsa-sha256; cv=none; b=AuUyugfMtcqkIHhMslskZcSMlxu7LyEiTYcmtv01h3HIREdO6o7qBU7p5WBeh6913LhfbN YkhtxSwKnBMR41mzGzf2B74m4Ln2Z5tSM7NdFciFYYKADUA36mVWMniTasQ4pAQrPHXG4J qtSs1EdAtIDaMrlty6PUa7ULjtK2R0Q= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=icloud.com header.s=1a1hai header.b=mAVYbdvl; spf=pass (imf06.hostedemail.com: domain of bfguo@icloud.com designates 157.255.1.68 as permitted sender) smtp.mailfrom=bfguo@icloud.com; dmarc=pass (policy=quarantine) header.from=icloud.com Received: from smtpclient.apple (gy11p00im-asmtpcmvip.gy.silu.net [112.19.199.76]) by gy11p00im-quki07202302.gy.silu.net (Postfix) with ESMTPSA id B6DB821400C0; Tue, 3 Mar 2026 08:27:52 +0000 (UTC) Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=icloud.com; s=1a1hai; t=1772526477; x=1775118477; bh=YCApGIaGSKZC8cCTCH+72YgshK9JNaLJszY1i6IDUso=; h=Content-Type:Mime-Version:Subject:From:Date:Message-Id:To:x-icloud-hme; b=mAVYbdvlGt2XJPyv8WVaRi53G8y5sboyGK3o8S6ucnENvfVhhu6yRqgpADxL4rqsUZBhx8QPvKEKDubg/Vk7hl+UM2ITtfqFP2FdSTTFhSZ0loQWontGch39Rjp+7AKu1sHnszw4nDl6a6L61tPczwhLTUxH7kwxK7mpryN7RpyYcqGR0JgLYoF9eR65XYEPo0lvCEzYTl8VFSiBGZjZWzESr7WYBdGErDH6dG25lwiiB42Iydv58lyjBVW2IXK0+CjZ/77bY2Zt3f37wrbsguFzrPqdtGKEFVYRPMyoxcAVNw0MMn5RyxkE4vHMTdRCHRiIsbDNbqRDbcI1w3UnZw== Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.300.41.1.7\)) Subject: Re: [PATCH] mm/mglru: fix cgroup OOM during MGLRU state switching From: Bingfang Guo In-Reply-To: Date: Tue, 3 Mar 2026 16:27:40 +0800 Cc: Axel Rasmussen , Barry Song <21cnbao@gmail.com>, Yuanchu Xie , Kairui Song , BINGFANG GUO , lenohou@gmail.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, weixugc@google.com, wjl.linux@gmail.com, yuzhao@google.com Content-Transfer-Encoding: quoted-printable Message-Id: <71281163-F081-456F-AFB0-37DC0470850A@icloud.com> References: <20260228161008.707-1-lenohou@gmail.com> <20260228212837.59661-1-21cnbao@gmail.com> To: Yafang Shao X-Mailer: Apple Mail (2.3864.300.41.1.7) X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwMzAzMDA2MSBTYWx0ZWRfXwDAoalcqZKk8 NLmHOXQJEf9qNZFhXgDZXh/S6cnvYNsETHQod2A3KNr4DIVcU+aC2i5K7/4xZGZM7DtnJGWgVWI Rkh9Zx0VKokLd6xehAcB1Iac/FvhnssFB84T17I91AKy8weJtYhWnF20i8xPvLsRpXYH31DuQ6I Xar3s/TxfLCFBRmS4F7Xt7sR8NTk3ZPaGU9ZIfcX31zDYVkRjXzDp7Lo8B4byZW7nlo/uIlDM2K WxDuSD0ChhQ8nHr9BVdUx9VdX9jDtGoNAMNiv1787aozu+/+gfq4+H7kk7FZkEncrOVxmy24Yvs cP/UBmjJCjYc4V0a/DbFxLwirhboJn1itj7VtGvjA== X-Proofpoint-GUID: -yPNnIHMyVVustWlfOAtOyxfNONiM7Za X-Proofpoint-ORIG-GUID: -yPNnIHMyVVustWlfOAtOyxfNONiM7Za X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-03-02_05,2026-03-03_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 malwarescore=0 mlxlogscore=999 phishscore=0 suspectscore=0 lowpriorityscore=0 adultscore=0 mlxscore=0 clxscore=-2147483648 bulkscore=0 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.22.0-2601150000 definitions=main-2603030061 X-Stat-Signature: 7icwxjsxutoo9m4x1escnrmu5hhtkktx X-Rspam-User: X-Rspamd-Queue-Id: C8E36180002 X-Rspamd-Server: rspam12 X-HE-Tag: 1772526481-168374 X-HE-Meta: U2FsdGVkX1/YdGmZrLfFgdkbv3q4wVSt/dsPy10XtvVXvzVFGf6e+hCr1u8yiGCbZdcu/3REUVhvve+IhPdiDoQkGZKRzx1Y3ROSh3hDXzq+nD1ZG+mw5V7B85gFdRfPIvR7gd6EVogvTGeE+4I7KMKhCyti+Na7b2K+LcC5D8j1VUEoZBfU+OZd7kZISrrKEKXfGtfVtIfMbTOiGc9blPnD4ZBUd9zWcmaEd8FGQVRrRFGUIfdJXiIA3DIvVPKjfFdMBvl3bglP7Aak+eEC4+wiqgXXNttwfyvo0pkgkY1K08dM95Ml9G4W65Czmo67HrIarY8SlcketoadAR4ucFWPaBWELLBhVNkCBZlM0FhD7nXFKDVXTseV4zj7fnL/b0mET2Ow55Hv+o4FfTLQfr5QzGuOeUzB/MOI5eS5rxVYtvmPb7vdZT38rknsKoWsSRE5l4siHkfMeSPYdoGxQRjV7HVUl93w5AcrkZsUEDOshfZDnHX97Mjdmc7UlRj3xP6568aJIieBjECvJig1IS+Kf0ALEPB3tKtt4f5NXHWOb9iej2a3qbeesVWzD7SD6f+KzSmNrCsp2LVPLC3WiKtaU+4aSUaZ5RKSc0wCRp3LnaFnN8UeEtA7I8vZHpL/HQxvhYHNZw73dQa7awzr1MsfiqfJC82Hi8hi3NpPpoU3MCFKvotGysmgmMpvtuqD29Ytwfy7A1PSZd+S6iqLmbYtGZfteTybZ0K4Nuc3j3dCpanbFrSBLaWwVC13ENSAZ7XcARS2xkspPhhjkCovT4Snfb7KY6bMzzBlnJaJEDpDeETVP+HN8BzX8oRuvkmfWoF6L/c92pv1V7d42qHMcduKstWC22J4r/ceU/kjq75D4f1N+Hv4kdTISk26xiMzo17JsTRBkhGefm/23hjmCinJiqas6NdDAs0tgeKWmqfNhibAQ50eF978ME1nYAR7TndR6VCgQY5IoCxIpww lI07s/Sf GxF7CcKIj+dEFZuZXziga66qsPEoSG8E8IyM+K1JeKoHQrB3hwqKo+OTQvGqizcaFAPFtfCrW4GAX0DKDhaSFYz798stCSSrLN5tAvY/lZt4aHEtbDLzsIE/XJp2R35PNqe5drJCRorBlH9Cc7KYcQDkmTSJQnrHIfplkgKIlCDbjGfI50Kv1En1UU2xE4doJTPOGhZeuGpjwvqeDd+xBl4P3PkiqUT2FNcVyYOdUpII1oNvqGQu3/lP/TkBosUMe1zA+BJBbdTArWpG3y27+RnSGw+z8k1Jx+EOA9DZrWv2ub9CGyiNY4ffF6G7raJ2W+QjgybVQCnNMoiC+9fyRCs7RHtWzpyjc4iznTbN2FFMuyzNvXr/sLVVWYzqX05I0GcVyFg07WGPqCQtxfUXOnmOOeW6o8sVumaM676spyyNiMEaFNbUQIQq5orpdzv4WRwoyaJbzvvthmrJbsfPyVoFKBw== Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Resend due to missing In-Reply-To header. Hi all. Thanks for inviting me to the discussion. I'm glad to join you = and share my ideas and findings with you. On Mon, Mar 2, 2026 at 4:25=E2=80=AFPM Yafang Shao = wrote: > > On Mon, Mar 2, 2026 at 4:00=E2=80=AFPM Kairui Song = wrote: > > > > On Mon, Mar 2, 2026 at 3:43=E2=80=AFPM Yafang Shao = wrote: > > > > > > On Mon, Mar 2, 2026 at 2:58=E2=80=AFPM Barry Song = <21cnbao@gmail.com> wrote: > > > > > > > > I assume latency is not a concern for a very rare > > > > MGLRU on/off case. Do you require the switch to happen > > > > with zero latency? > > > > My main concern is the correctness of the code. > > > > > > > > Now the proposed patch is: > > > > > > > > + bool lrugen_enabled =3D = smp_load_acquire(&lruvec->lrugen.enabled); > > > > + bool lru_draining =3D = smp_load_acquire(&lruvec->lrugen.draining); > > > > > > > > Then choose MGLRU or active/inactive LRU based on > > > > those values. > > > > > > > > However, nothing prevents those values from changing > > > > after they are read. Even within the shrink path, > > > > they can still change. > > > > Hi all, > > > > > If these values are changed during reclaim, the currently running > > > reclaimer will continue to operate with the old settings, while = any > > > new reclaimer processes will adopt the new values. This approach > > > should prevent any immediate issues, but the primary risk of this > > > lockless method is the potential for a user to rapidly toggle the > > > MGLRU feature, particularly during an intermediate state. > > > > > > > > > > > So I think we need an rwsem or something similar here =E2=80=94 > > > > a read lock for shrink and a write lock for on/off. The > > > > write lock should happen very rarely. > > > > > > We can introduce a lock-based mechanism in v2. > > > > I hope we don't need a lock here. Currently there is only a static > > key, this patch is already adding more branches, a lock will make > > things more complex and the shrinking path is quite performance > > sensitive. > > > > > > > > > > To be honest, the on/off toggle is quite odd. If possible, > > > > I=E2=80=99d prefer not to switch MGLRU or active/inactive > > > > dynamically. Once it=E2=80=99s set up during system boot, it > > > > should remain unchanged. > > > > > > While it is well-suited for Android environments, it is not viable = for > > > Kubernetes production servers, where rebooting is highly = disruptive. > > > This limitation is precisely why we need to introduce dynamic = toggles. > > > > I agree with Barry, the switch isn't supposed to be a knob to be > > turned on/off frequently. And I think in the long term we should = just > > identify the workloads where MGLRU doesn't work well, and fix MGLRU. > > The challenge we're currently facing is that we don't yet know which > workloads would benefit from it ;) > We do want to enable mglru on our production servers, but first we > need to address the risk of OOM during the switch=E2=80=94that's = exactly why > we're proposing this patch. Yes. I believe our long term target is to integrate the two LRU = implementations. But for now, it's important to keep this dynamic toggling feature and = make it robust and work well. So if users are willing to try the new LRU = algorithm, they are free to enable it after system boots for testing, and disable it if = they run into some trouble without worrying about OOM and other problems. = Therefore, we can have more users and potentially expose more problems related to = MGLRU and fix them. On Mon, Mar 3, 2026 at 1:34=E2=80=AFAM Barry Sone <21cnbao@gmail.com> = wrote: > 2. Ensure that shrinking and switching do not occur > simultaneously by using something like an rwsem =E2=80=94 > shrinking can proceed in parallel under the read > lock, while the (rare) switching path takes the > write lock. In my opinion, completely banning others from reclaming seems to demand = more than needed. We have many huge servers with services running in = enourmous memcg. In such case, waiting for the draining to complete may take so long = (tens of seconds for example) that the service get many timeout failures. But = there's high chance that reclaimers can still reclaim enough even if the draning = is not completed. So maybe we can have concurrent reclaming and state switch = draining? Regarding the discussion, I would like to propose a slightly different = approach that is already in use in production. The proposal mainly focuses on two practical considerations: 1. State switching is a rare operation. So we should not penalize the = normal reclaim path or introduce more locks for this rare case. 2. We should avoid introducing long latency spikes during production = state transitions (e.g., switching on live machines). The downstream solution is very similar to a combination of all your = proposals, but with some radical attempts to try to avoid sleeping therefore reduce = lags from waiting as much as possible. At the same time, we keep a = last-chance wait to prevent early OOMs from happening. We use a static key to indicate that the state change is in progress. = All operations are encapsulated in that slow path so no extra overhead for = the normal path. If we are in the draining process, we first try reclaiming = from where lrugen->enabled says we are, if we still have a lot of retry times = left. With no retries left, simply wait until the we pass the race window. -- Thanks Bingfang -- diff --git a/mm/vmscan.c b/mm/vmscan.c index 614ccf39fe3f..d7ff7a6ed088 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2652,6 +2652,43 @@ static bool can_age_anon_pages(struct lruvec = *lruvec, =20 #ifdef CONFIG_LRU_GEN =20 +DEFINE_STATIC_KEY_FALSE(lru_gen_draining); + +static inline bool lru_gen_is_draining(void) +{ + return static_branch_unlikely(&lru_gen_draining); +} + +/* + * Lazily wait for the draining thread to finish if it's running. + * + * Return: whether we'd like to reclaim from multi-gen LRU. + */ +static inline bool lru_gen_draining_wait(struct lruvec *lruvec, struct = scan_control *sc) +{ + bool global_enabled =3D lru_gen_enabled(); + + /* Try reclaiming from the current LRU first */ + if (sc->priority > DEF_PRIORITY / 2) + return READ_ONCE(lruvec->lrugen.enabled); + + /* Oops, try from the other side... */ + if (sc->priority > 1) + return global_enabled; + + /* + * If we see lrugen.enabled is consistent here, when we get the = lru + * spinlock, the migrating thread will have filled the lruvec = with some + * pages, so we can continue without waiting. + */ + while (global_enabled ^ READ_ONCE(lruvec->lrugen.enabled)) { + /* Not switching this one yet. Wait for a while. */ + schedule_timeout_uninterruptible(1); + } + + return global_enabled; +} + #ifdef CONFIG_LRU_GEN_ENABLED DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS); #define get_cap(cap) static_branch_likely(&lru_gen_caps[cap]) @@ -5171,6 +5208,8 @@ static void lru_gen_change_state(bool enabled) if (enabled =3D=3D lru_gen_enabled()) goto unlock; =20 + static_branch_enable_cpuslocked(&lru_gen_draining); + if (enabled) = static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]); else @@ -5201,6 +5240,9 @@ static void lru_gen_change_state(bool enabled) =20 cond_resched(); } while ((memcg =3D mem_cgroup_iter(NULL, memcg, NULL))); + + static_branch_disable_cpuslocked(&lru_gen_draining); + unlock: mutex_unlock(&state_mutex); put_online_mems(); @@ -5752,6 +5794,16 @@ late_initcall(init_lru_gen); =20 #else /* !CONFIG_LRU_GEN */ =20 +static inline bool lru_gen_is_draining(void) +{ + return false; +} + +static inline bool shrink_lruvec_draining(struct lruvec *lruvec, struct = scan_control *sc) +{ + return false; +} + static void lru_gen_age_node(struct pglist_data *pgdat, struct = scan_control *sc) { BUILD_BUG(); @@ -5780,7 +5832,10 @@ static void shrink_lruvec(struct lruvec *lruvec, = struct scan_control *sc) bool proportional_reclaim; struct blk_plug plug; =20 - if (lru_gen_enabled() && !root_reclaim(sc)) { + if (lru_gen_is_draining() && lru_gen_draining_wait(lruvec, sc)) = { + lru_gen_shrink_lruvec(lruvec, sc); + return; + } else if (lru_gen_enabled() && !root_reclaim(sc)) { lru_gen_shrink_lruvec(lruvec, sc); return; }=