From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BD541EE20BD for ; Mon, 9 Feb 2026 08:17:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E06B46B0005; Mon, 9 Feb 2026 03:17:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DB3A36B0088; Mon, 9 Feb 2026 03:17:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CB2886B0089; Mon, 9 Feb 2026 03:17:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BA3EB6B0005 for ; Mon, 9 Feb 2026 03:17:24 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2114E1B2FD3 for ; Mon, 9 Feb 2026 08:17:24 +0000 (UTC) X-FDA: 84424213608.13.D523702 Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) by imf07.hostedemail.com (Postfix) with ESMTP id 3A73840002 for ; Mon, 9 Feb 2026 08:17:17 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; spf=pass (imf07.hostedemail.com: domain of chenridong@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=chenridong@huaweicloud.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1770625042; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jyaX9VSLBdbnNBt5AvbHUd6L7XoEFfkRLFyElTfcEc0=; b=E2XNiWp78shn1PsVXw03N4OW4NOIdnQeivpWSQ4BlbuXCK8kAnzoi+rCLQ69jCMnynpMH7 EjFTM3zKU+iMX4igjt/4jTffEZ635cw+s9fAv1oJeBH25fU/e+Pqw7yx60a38XKYId/n5m mL9NV1p0+aZR9UdnnNKRSlH81ifRQYA= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf07.hostedemail.com: domain of chenridong@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=chenridong@huaweicloud.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1770625042; a=rsa-sha256; cv=none; b=Gzew8+oCm/XNgDGn7OnWkwpLVtdA2DVgx32D03a5sSWdaNP6AK+sThd/u/DYC7dIcvcTa4 gM6yENWS7zdS9rfrUcABxto65dAAXJysTgGM5iGRKX+xBVDXLsi32W75jvqSnWe2lRwIQ0 0IfKeWRz3/3bcniymRIA2JzV9RcvoFs= Received: from mail.maildlp.com (unknown [172.19.163.177]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTPS id 4f8cwH6s01zYQtlM for ; Mon, 9 Feb 2026 16:16:15 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id 263DE40590 for ; Mon, 9 Feb 2026 16:17:13 +0800 (CST) Received: from [10.67.111.176] (unknown [10.67.111.176]) by APP4 (Coremail) with SMTP id gCh0CgDnR_gGmIlph7o1Gw--.57276S2; Mon, 09 Feb 2026 16:17:12 +0800 (CST) Message-ID: <6ad1fb5d-a859-4611-8af9-aa4d37aeeb38@huaweicloud.com> Date: Mon, 9 Feb 2026 16:17:10 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim To: Yuanchu Xie Cc: akpm@linux-foundation.org, axelrasmussen@google.com, weixugc@google.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, corbet@lwn.net, skhan@linuxfoundation.org, hannes@cmpxchg.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, zhengqi.arch@bytedance.com, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, lujialin4@huawei.com, ryncsn@gmail.com References: <20260120134256.2271710-1-chenridong@huaweicloud.com> <20260120134256.2271710-2-chenridong@huaweicloud.com> Content-Language: en-US From: Chen Ridong In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-CM-TRANSID:gCh0CgDnR_gGmIlph7o1Gw--.57276S2 X-Coremail-Antispam: 1UD129KBjvJXoW3XrWxWw47CFy7XrWkWry7GFg_yoWfZFyfpF Z3JF4ayan7Xr13Kwnaq3WUWr93Aw1xKr1ayrW3KF1fAwsIvr10vw42kr43ZFW5ArWUXr1f ZryYgr13uw4qva7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUvYb4IE77IF4wAFF20E14v26ryj6rWUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_tr0E3s1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcVAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7MxkF7I0En4kS 14v26r4a6rW5MxAIw28IcxkI7VAKI48JMxC20s026xCaFVCjc4AY6r1j6r4UMI8I3I0E5I 8CrVAFwI0_Jr0_Jr4lx2IqxVCjr7xvwVAFwI0_JrI_JrWlx4CE17CEb7AF67AKxVW8ZVWr XwCIc40Y0x0EwIxGrwCI42IY6xIIjxv20xvE14v26r1j6r1xMIIF0xvE2Ix0cI8IcVCY1x 0267AKxVW8JVWxJwCI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_ Jr0_Gr1lIxAIcVC2z280aVCY1x0267AKxVW8Jr0_Cr1UYxBIdaVFxhVjvjDU0xZFpf9x07 jIksgUUUUU= X-CM-SenderInfo: hfkh02xlgr0w46kxt4xhlfz01xgou0bp/ X-Stat-Signature: 6xdzgjrqfsfbrbkhorfhdxjnhdppfow9 X-Rspamd-Queue-Id: 3A73840002 X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1770625037-460419 X-HE-Meta: U2FsdGVkX18iomO4b2Ttm93OmUeMht0UsycCI7gTXFMp60vGnB4eAulx5Z1iGfkjOt+6c8y5KFzX7Zut6aOg2HFMj2x249cNUXKyUK5nm0EBmld+ZmeMJ/6BhhfUSCqsR9juHlpDbxNm9J7sRkIJJkEOiG5adMQJ65CE1w80TIROzDg8aJlxe+EFP1M+r+deLhI4q3ArxmqLoy2Mfsb+QHA7oDP41yZ7sT7AttzdIHjmvHJO+yMXeWYp+9D1e9fcvUiTQqO2aXSoS9gqhBnwuQdXtL+dKipkeuiSCLqUer1/XuR7FfdGmvnqk82shLO5Ek7X0GMpYxzApfyOKQUr4fzMKC4GKv7YkXhh8G1dZ44+m8IyIidjMhaEcvw1BiHhAM9IzILNS4FANTAcMJqWgmhQE7/cp5/4Pb42pTzCXNa7E+PV/2FwKi8afycWzkhzlgBOgPM5YZBFz44dhr+SCHiFvbDbSvwXcQwdQxonoZ8HWED46BYASidYBd2PeqD4VkMw5f4dJ07dAkbn2Ma5yPMTX9PX8lBOd4soQEJj/YP7FdLwSKcNRjkHjqPdNS1WeryRjMoqiyAiq13JLnt8FSj/NpXTttd4uIosWHXNHt0r+1Y4xr3pCrHm52DDjZJdsZVH+8Ln6eySKf4gjlkA0+z8Bl8Hs6X02gzaFQEttJnvTOyxtnAiEdWiVyMNY2EFy5K+FOCLklSQius7d41IoCwtoviasSPuPQD1vFdFb7agLitzI4KgZVpJtUPNStqg75SJm6uuAKTKyNO5wEhmOlrRUV1WMvosvPWv6AtDrqkoiuSmLwEEsPUMNHbvu9debJadNrEBOBDfI+mtt/oJtM1wWg1fORLdAzHzNLcfkMiAjPVUqB/RJxYMqjagykasWHpX90AVlfFRcA8x1+9nJ+g7FsoFAdSJgv6iaGZWZsW721tU3s9oMiDVlXiKpT8ApSPkdfwwZw+lQKLPPRw SLXKGR0N CvzUV8zZxiWPc7FiSwHOgB8sqCMzeW0qe+xHH5WrAKdhN4THNssfAEdUmJxVob7xuK41gvNP8fWUfIwkdZ2d/EOIsck2k8zN7uerOD/6m9/UHUD7JAjAmXvihYb4RJwOel5dgzzgNY5e5a7kkflNKiodBS45j0T9O5GZUgVtsi920QjwBuE4QxtjGnVdj+eA1BO3vBPQRdPXb1yYsSJ8oDwJhNcgH3q2WNdbdYV1Inc4LtBQ0L0W8Kko0Qt7KlkdqhBEMaiLRxtlGyhY6Dj8kIyPVOK+nvn6PaUa57r9apoNDwTI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Yuanchu, On 2026/2/7 6:47, Yuanchu Xie wrote: > Hi Ridong, > > Thanks for working to reconcile the gaps between the LRU implementations. > > On Tue, Jan 20, 2026 at 7:57 AM Chen Ridong wrote: >> >> From: Chen Ridong >> >> The memcg LRU was originally introduced to improve scalability during >> global reclaim. However, it is complex and only works with gen lru >> global reclaim. Moreover, its implementation complexity has led to >> performance regressions when handling a large number of memory cgroups [1]. >> >> This patch introduces a per-memcg heat level for reclaim, aiming to unify >> gen lru and traditional LRU global reclaim. The core idea is to track >> per-node per-memcg reclaim state, including heat, last_decay, and >> last_refault. The last_refault records the total reclaimed data from the >> previous memcg reclaim. The last_decay is a time-based parameter; the heat >> level decays over time if the memcg is not reclaimed again. Both last_decay >> and last_refault are used to calculate the current heat level when reclaim >> starts. >> >> Three reclaim heat levels are defined: cold, warm, and hot. Cold memcgs are >> reclaimed first; only if cold memcgs cannot reclaim enough pages, warm >> memcgs become eligible for reclaim. Hot memcgs are reclaimed last. >> >> While this design can be applied to all memcg reclaim scenarios, this patch >> is conservative and only introduces heat levels for traditional LRU global >> reclaim. Subsequent patches will replace the memcg LRU with >> heat-level-based reclaim. >> >> Based on tests provided by YU Zhao, traditional LRU global reclaim shows >> significant performance improvement with heat-level reclaim enabled. >> >> The results below are from a 2-hour run of the test [2]. >> >> Throughput (number of requests) before after Change >> Total 1734169 2353717 +35% >> >> Tail latency (number of requests) before after Change >> [128s, inf) 1231 1057 -14% >> [64s, 128s) 586 444 -24% >> [32s, 64s) 1658 1061 -36% >> [16s, 32s) 4611 2863 -38% > > Do you have any numbers comparing heat-based reclaim to memcg LRU? I > know Johannes suggested removing memcg LRU, and what you have here > applies to more reclaim scenarios. > Yes, the test data is provided in patch 5/7. >> >> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org >> [2] https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ >> >> Signed-off-by: Chen Ridong >> --- >> include/linux/memcontrol.h | 7 ++ >> mm/memcontrol.c | 3 + >> mm/vmscan.c | 227 +++++++++++++++++++++++++++++-------- >> 3 files changed, 192 insertions(+), 45 deletions(-) >> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >> index af352cabedba..b293caf70034 100644 >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -76,6 +76,12 @@ struct memcg_vmstats; >> struct lruvec_stats_percpu; >> struct lruvec_stats; >> >> +struct memcg_reclaim_state { >> + atomic_long_t heat; >> + unsigned long last_decay; >> + atomic_long_t last_refault; >> +}; >> + >> struct mem_cgroup_reclaim_iter { >> struct mem_cgroup *position; >> /* scan generation, increased every round-trip */ >> @@ -114,6 +120,7 @@ struct mem_cgroup_per_node { >> CACHELINE_PADDING(_pad2_); >> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; >> struct mem_cgroup_reclaim_iter iter; >> + struct memcg_reclaim_state reclaim; >> >> #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC >> /* slab stats for nmi context */ >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index f2b87e02574e..675d49ad7e2c 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -3713,6 +3713,9 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) >> >> lruvec_init(&pn->lruvec); >> pn->memcg = memcg; >> + atomic_long_set(&pn->reclaim.heat, 0); >> + pn->reclaim.last_decay = jiffies; >> + atomic_long_set(&pn->reclaim.last_refault, 0); >> >> memcg->nodeinfo[node] = pn; >> return true; >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 4aa73f125772..3759cd52c336 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -5978,6 +5978,124 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, >> return inactive_lru_pages > pages_for_compaction; >> } >> >> +enum memcg_scan_level { >> + MEMCG_LEVEL_COLD, >> + MEMCG_LEVEL_WARM, >> + MEMCG_LEVEL_HOT, >> + MEMCG_LEVEL_MAX, >> +}; >> + >> +#define MEMCG_HEAT_WARM 4 >> +#define MEMCG_HEAT_HOT 8 >> +#define MEMCG_HEAT_MAX 12 >> +#define MEMCG_HEAT_DECAY_STEP 1 >> +#define MEMCG_HEAT_DECAY_INTERVAL (1 * HZ) > I agree with Kairui; I'm somewhat concerned about this fixed decay > interval and how it behaves with many memcgs or heavy pressure. > Yes, a fixed decay interval may not be optimal for all scenarios. It serves as a foundational baseline. Perhaps we could expose a BPF hook here for more flexible tuning. The referenced benchmark [2] specifically tests under heavy pressure (continuously triggering global reclaim) and with a large number of memory cgroups. >> + >> +static void memcg_adjust_heat(struct mem_cgroup_per_node *pn, long delta) >> +{ >> + long heat, new_heat; >> + >> + if (mem_cgroup_is_root(pn->memcg)) >> + return; >> + >> + heat = atomic_long_read(&pn->reclaim.heat); >> + do { >> + new_heat = clamp_t(long, heat + delta, 0, MEMCG_HEAT_MAX); >> + if (atomic_long_cmpxchg(&pn->reclaim.heat, heat, new_heat) == heat) >> + break; >> + heat = atomic_long_read(&pn->reclaim.heat); >> + } while (1); >> +} >> + >> +static void memcg_decay_heat(struct mem_cgroup_per_node *pn) >> +{ >> + unsigned long last; >> + unsigned long now = jiffies; >> + >> + if (mem_cgroup_is_root(pn->memcg)) >> + return; >> + >> + last = READ_ONCE(pn->reclaim.last_decay); >> + if (!time_after(now, last + MEMCG_HEAT_DECAY_INTERVAL)) >> + return; >> + >> + if (cmpxchg(&pn->reclaim.last_decay, last, now) != last) >> + return; >> + >> + memcg_adjust_heat(pn, -MEMCG_HEAT_DECAY_STEP); >> +} >> + >> +static int memcg_heat_level(struct mem_cgroup_per_node *pn) >> +{ >> + long heat; >> + >> + if (mem_cgroup_is_root(pn->memcg)) >> + return MEMCG_LEVEL_COLD; >> + >> + memcg_decay_heat(pn); > The decay here is somewhat counterintuitive given the name memcg_heat_level. > The decay is integrated into the level retrieval. Essentially, whenever memcg_heat_level is fetched, we check if the decay interval has elapsed (interval > MEMCG_HEAT_DECAY_INTERVAL). If so, the decay is applied. >> + heat = atomic_long_read(&pn->reclaim.heat); >> + >> + if (heat >= MEMCG_HEAT_HOT) >> + return MEMCG_LEVEL_HOT; >> + if (heat >= MEMCG_HEAT_WARM) >> + return MEMCG_LEVEL_WARM; >> + return MEMCG_LEVEL_COLD; >> +} >> + >> +static void memcg_record_reclaim_result(struct mem_cgroup_per_node *pn, >> + struct lruvec *lruvec, >> + unsigned long scanned, >> + unsigned long reclaimed) >> +{ >> + long delta; >> + >> + if (mem_cgroup_is_root(pn->memcg)) >> + return; >> + >> + memcg_decay_heat(pn); > Could you combine the decay and adjust later in this function? > Sure. >> + >> + /* >> + * Memory cgroup heat adjustment algorithm: >> + * - If scanned == 0: mark as hottest (+MAX_HEAT) >> + * - If reclaimed >= 50% * scanned: strong cool (-2) >> + * - If reclaimed >= 25% * scanned: mild cool (-1) >> + * - Otherwise: warm up (+1) >> + */ >> + if (!scanned) >> + delta = MEMCG_HEAT_MAX; >> + else if (reclaimed * 2 >= scanned) >> + delta = -2; >> + else if (reclaimed * 4 >= scanned) >> + delta = -1; >> + else >> + delta = 1; >> + >> + /* >> + * Refault-based heat adjustment: >> + * - If refault increase > reclaimed pages: heat up (more cautious reclaim) >> + * - If no refaults and currently warm: cool down (allow more reclaim) >> + * This prevents thrashing by backing off when refaults indicate over-reclaim. >> + */ >> + if (lruvec) { >> + unsigned long total_refaults; >> + unsigned long prev; >> + long refault_delta; >> + >> + total_refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_ANON); >> + total_refaults += lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_FILE); >> + >> + prev = atomic_long_xchg(&pn->reclaim.last_refault, total_refaults); >> + refault_delta = total_refaults - prev; >> + >> + if (refault_delta > reclaimed) >> + delta++; >> + else if (!refault_delta && delta > 0) >> + delta--; >> + } > > I think this metric is based more on the memcg's reclaimability than > on heat. Though the memcgs are grouped based on absolute metrics and > not relative to others. > I might be misunderstanding your comment. Could you elaborate? As designed, the heat level is indeed derived from the memcg's own reclaimability (reclaimed/scanned) and refault behavior. In essence, it quantifies the difficulty or “heat” of reclaiming memory from that specific cgroup. This metric directly correlates to whether a memcg can release memory easily or not. >> + >> + memcg_adjust_heat(pn, delta); >> +} >> + >> static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) >> { >> ...snip >> } > > Thanks, > Yuanchu -- Best regards, Ridong