From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 58F76D46942 for ; Wed, 21 Jan 2026 14:58:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 517786B0005; Wed, 21 Jan 2026 09:58:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4BE9F6B0088; Wed, 21 Jan 2026 09:58:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 375CC6B0089; Wed, 21 Jan 2026 09:58:39 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 25C0F6B0005 for ; Wed, 21 Jan 2026 09:58:39 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CC6CEC1F4A for ; Wed, 21 Jan 2026 14:58:38 +0000 (UTC) X-FDA: 84356277516.02.379C3BC Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf04.hostedemail.com (Postfix) with ESMTP id E872A40013 for ; Wed, 21 Jan 2026 14:58:36 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=O7+CNAmJ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1769007517; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UpnJM6r29OmlECz6ZZJlv11hzQUIyaYlzl/s4DXsy2Q=; b=ka4CENGtl7P+El98zrA30P1dKoKwTWWyLWXigpo9YGod1cJCiCVWrfck4/prT0rVTYCe64 KkBnOC0T/aSIVyqd/zp2fOBdd/bcFTscbcW1iqst2m/tyeIhCIz59ZAjJUWAw+DXfsY2ND uxKWIYHQL5yp7603kwRy8sQK69npr0w= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1769007517; a=rsa-sha256; cv=none; b=08iv26N9NzrsEGHi0zc5Qp5WOhSKTvHv9Js50AtgXQh+koVMMTItgrElTVge4BaacUWl2i QqTzSBPMYi8WM/hEJ6i8/hZfDOdTwVbi2q5RD1C56y2bKE7QfONZdPoT8DHK4mmd4QO7As S6RAr8bdhez7eR4ysA9STM0RNYhtxAU= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=O7+CNAmJ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf04.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-2a7afb2cf09so4614465ad.3 for ; Wed, 21 Jan 2026 06:58:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769007516; x=1769612316; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=UpnJM6r29OmlECz6ZZJlv11hzQUIyaYlzl/s4DXsy2Q=; b=O7+CNAmJZ/b8oPIt3ciCMIjbIGt6SYmsz4/eG6RW/InBfYAtUrDmduv7cZjx5aqYcr fftlT1JsaOIZ/EBBq5Wbfc6jCklOflDCujarsT0uQ4kvWrZOQnMPG3bHA0s93EvKG/iH s/Jctviw+vdrgGcAs65yypIW/2DOUtylR5Ez2WCpGADWdNOhSKGQxNa5W5wX/dMpDcnR cRjGCNj8E7jGdlLvuSQ9MeFU1LrjOjNFEAdloxknSpBLxzg/weGAQpfBlu2JR9/h6OGJ jZiVv6RICYs8YponoDl7zgEwIIXVcvjxWqrTI4rW1ipMUvI9dYNeeBKnR4gwKKZo9eeG V93Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769007516; x=1769612316; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UpnJM6r29OmlECz6ZZJlv11hzQUIyaYlzl/s4DXsy2Q=; b=T6251WVtjCVoYvgNib2ohJh8LfNAMddj7saMOUhSf15XG3YoCGNHwdQb4YIKpehpL7 lqeINdfyRU1tY0ajsAv4Ebo5kKsCOWx2Qb3FJq3SBa7NJw+KSqKOtQEVl9VXAYq1sLQn 5HulnCBzksY+ygUbw/kQTw7ddqHK3ZSE3kCjWbVE3vd3lPekBLv27w8MPjyzF7aVcrOO i6aDEUvqkhv9E41fx9ZkccQV8lXGAhIpLvjJwDrQU7/Yj5B+9nZGjJ0mbJFYYx2oQlRv 8bMyu9I89bTcrq4E7fpRuBrwMYAJ9+FQRRf+RXpfs+iXPpod0dDcf5nwo6jN/mZh4MWX Wf0g== X-Forwarded-Encrypted: i=1; AJvYcCXzTt2+/fcPBvprLHVokjpsIZIO9UpEpAXXN+rmthVlzBT3W0UPiivQyn44tJLIEEDAV68iRZEBUQ==@kvack.org X-Gm-Message-State: AOJu0Yxykq7AnLQbKkbh7kXEK0P72tZMUjpV3iStKWl2GiqH5SkPpOqc BzeutU8tSfcij7F1xPv4CxhDY3bPYRt6l1CNSkJPMZ7A0xAwaV/M8lHa X-Gm-Gg: AZuq6aL4dNa3DQSaJHZ64DUhIqxrTFCvLUEvP0LUorinIvmydj880O7jFq49msaSSvc v3ukDP/wlQXTDxvk4igQTemgKtfilMyut4bK/MP6F3VAHc7DYMxVMN/x1nQfBUQ/J9KbmOl/sPx 85Dmss1h6R32IGgGy65zHUSb7skAuNTwOHWWZdT4vFx18sxNsScKrYa2MG8CwXsGaOPvUStn/1H +uzpECHznHLkOfa3XQisJdyywNmdWACtoXITjYe3odM9paPCk/oH9VLa5DAoQ7AQIIL7FMsACTF mR3tdbMM6PJag+udPO8QNIz4zNc0wGOcLycFtbPF+mIJx6xJ1a4ypycxhZfqGpGgP/+EiND8HZ1 YZOlCNf7mOZmq7e5E63jsXQE0T5BbZfR5UjH8gTRKHkXMrWK2yVLJn7aIWSLoYyjT2WdZA9Bdfk 6BYFojmu1XL6L25q17Lc+LOKgjvD+cI8EHwRmzy4Tgv7uEd+g= X-Received: by 2002:a17:902:c949:b0:2a1:3cd8:d2df with SMTP id d9443c01a7336-2a7177db71fmr179679185ad.54.1769007515562; Wed, 21 Jan 2026 06:58:35 -0800 (PST) Received: from KASONG-MC4 ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2a7646119b4sm53170045ad.71.2026.01.21.06.58.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Jan 2026 06:58:34 -0800 (PST) Date: Wed, 21 Jan 2026 22:58:27 +0800 From: Kairui Song To: Chen Ridong Cc: akpm@linux-foundation.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, david@kernel.org, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, corbet@lwn.net, skhan@linuxfoundation.org, hannes@cmpxchg.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, zhengqi.arch@bytedance.com, linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, lujialin4@huawei.com Subject: Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim Message-ID: References: <20260120134256.2271710-1-chenridong@huaweicloud.com> <20260120134256.2271710-2-chenridong@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260120134256.2271710-2-chenridong@huaweicloud.com> X-Rspam-User: X-Rspamd-Queue-Id: E872A40013 X-Rspamd-Server: rspam12 X-Stat-Signature: z7yaodip3dz9y9trja47sty7d3kusfmp X-HE-Tag: 1769007516-687290 X-HE-Meta: U2FsdGVkX1/d0wEMm3LVSa6hXf6fJVps0QC9tnF8ovrDwDbJk77Uu/FIN6jlV1FCxZ1M10j61PVBPwMktHrdkTglqcjtw/PDQyAPXd+TL2zTu7mOQgC0aIrOmAsbDPJYMaUph2PEAw8Jzo82np99uHoaDkJsdEHXNWYXCkPZEYLtt/0OLv34xs5E0YNlDGsqF3wd4SQseFCQiKQw9wPWsuLIaPXeggX0JPdQPmQfveGOdDdL5Nzq6b6N5JW5zmaITWGjTpWa53q7yPnmkgtNR+EU/nbj2SC5FiisMLzJzB9iTHa5yekcoMsEwzBFQDiKbtdkoy1Vg5wYUAyai4g40mktBxfaV48lmwWtFD1W888eiPZrTd+K1sy8ElqSqCblWTqZfMMKs37nczn/vYq5SLKUWQbzUW9x3yobkFe4pP4t3EvdzapuTYt1nvx/435YQX9/wecRutrph/6buO1g7CY9PltIoSuNCkdMa0VjPDJs/k9FqGB3/9YV+Bef6fgecd/Z6VG0mxJUf4YeqryaHyI1Lm7mBXxL/deyWPFR1CKuJZwgKiZ9Cm7e/P619DFzWdnzRcmvs0aW8yJuy8IGUFqrYZIAyeZW3dJwMu5sBS/GctiGxNIChPzjSERoUzKGXyxr+ot7NHk7tGZC4avGd+vuNvb48FmtYqWcfJn6o3/M7ikI7I2wldXb2CzCxk33HNsqZvuVaq3SNSriWunHuaZf6SOiZE37KUj+8DlrJWJ2YYHERAHG2Z6N5Zzgosq8SZSw2d2PiAJJ9B5Q7S80yddUSqH3j0EcW60eSbhkbk/tKG2ElIHeHduEuT9q3LHJnuMdyLSblojyZq/SErqgTqMbZIIILZPX+jYhkK05eXKvPCU+bvxRdLpzP6FKaoN2oKQj5QsAXW4yBTNbcgJ67V2RVxRzizA8hRMOackLcce3++ztp2e1ud4Oz5ebXMUh11MY81ShvFVCWmoLdJU 66evWhT8 +0uHXQXcF5awkEo0rOHptkDUoyTcmDYZb6UBb3Nk85hGH71IpDiOxBxnI6ll5iD+YOgJSHE07wfL8wb83cN8E6wUWwfJMhQ7M8z8NMXrGKBwhhe8SY+hB3vXOhlCetBdjWmYD0veyljBarrz90qW1feJOfnE0epvoQjBAgIwjPP/CtajxDyxVOaiwtN2FTL0R8tEeHWGDXFf2zuVnZXuTCreX4RCgUBgm3YT7mgWsNYzbADjb49YRiGiPJw5VztbqlIIAoicb0ZY6hlEt+Ju1G2YWlYIHvsuFG6m/vw/82x6eW7FBtBXDyh7/0+IaLNB7QemcvgFEDfJ2Xteu1L/xkyIMXOpKWX5a3b1D4vKSreNweR0YMEDZVqkFrfSZAuCxYjJAW5PIYT2+ih3J8u+9cgQYvP+gke9dpLLSIkj3n6ygXfhmuxNHwWYEzz4lTBJ0suRlgvjfZ8HPMmjwLCmsyYasaQOzdHahE9k+77V8TSygZHWl6YJqcfOXhp1Al+GolXQFxofW/CQboEBv/t7GfeOhbiVbz28f5pXFX+uWuvmTvP4egByKNBUDFA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jan 20, 2026 at 01:42:50PM +0800, Chen Ridong wrote: > From: Chen Ridong > > The memcg LRU was originally introduced to improve scalability during > global reclaim. However, it is complex and only works with gen lru > global reclaim. Moreover, its implementation complexity has led to > performance regressions when handling a large number of memory cgroups [1]. > > This patch introduces a per-memcg heat level for reclaim, aiming to unify > gen lru and traditional LRU global reclaim. The core idea is to track > per-node per-memcg reclaim state, including heat, last_decay, and > last_refault. The last_refault records the total reclaimed data from the > previous memcg reclaim. The last_decay is a time-based parameter; the heat > level decays over time if the memcg is not reclaimed again. Both last_decay > and last_refault are used to calculate the current heat level when reclaim > starts. > > Three reclaim heat levels are defined: cold, warm, and hot. Cold memcgs are > reclaimed first; only if cold memcgs cannot reclaim enough pages, warm > memcgs become eligible for reclaim. Hot memcgs are reclaimed last. > > While this design can be applied to all memcg reclaim scenarios, this patch > is conservative and only introduces heat levels for traditional LRU global > reclaim. Subsequent patches will replace the memcg LRU with > heat-level-based reclaim. > > Based on tests provided by YU Zhao, traditional LRU global reclaim shows > significant performance improvement with heat-level reclaim enabled. > > The results below are from a 2-hour run of the test [2]. > > Throughput (number of requests) before after Change > Total 1734169 2353717 +35% > > Tail latency (number of requests) before after Change > [128s, inf) 1231 1057 -14% > [64s, 128s) 586 444 -24% > [32s, 64s) 1658 1061 -36% > [16s, 32s) 4611 2863 -38% > > [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org > [2] https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ Hi Ridong, Thanks very much for checking the test! The benchmark looks good. While I don't have strong opinion on the whole approach yet as I'm still checking the whole series. But I have some comment and question for this patch: > > Signed-off-by: Chen Ridong > --- > include/linux/memcontrol.h | 7 ++ > mm/memcontrol.c | 3 + > mm/vmscan.c | 227 +++++++++++++++++++++++++++++-------- > 3 files changed, 192 insertions(+), 45 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index af352cabedba..b293caf70034 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -76,6 +76,12 @@ struct memcg_vmstats; > struct lruvec_stats_percpu; > struct lruvec_stats; > > +struct memcg_reclaim_state { > + atomic_long_t heat; > + unsigned long last_decay; > + atomic_long_t last_refault; > +}; > + > struct mem_cgroup_reclaim_iter { > struct mem_cgroup *position; > /* scan generation, increased every round-trip */ > @@ -114,6 +120,7 @@ struct mem_cgroup_per_node { > CACHELINE_PADDING(_pad2_); > unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS]; > struct mem_cgroup_reclaim_iter iter; > + struct memcg_reclaim_state reclaim; > > #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC > /* slab stats for nmi context */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index f2b87e02574e..675d49ad7e2c 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3713,6 +3713,9 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) > > lruvec_init(&pn->lruvec); > pn->memcg = memcg; > + atomic_long_set(&pn->reclaim.heat, 0); > + pn->reclaim.last_decay = jiffies; > + atomic_long_set(&pn->reclaim.last_refault, 0); > > memcg->nodeinfo[node] = pn; > return true; > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 4aa73f125772..3759cd52c336 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -5978,6 +5978,124 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, > return inactive_lru_pages > pages_for_compaction; > } > > +enum memcg_scan_level { > + MEMCG_LEVEL_COLD, > + MEMCG_LEVEL_WARM, > + MEMCG_LEVEL_HOT, > + MEMCG_LEVEL_MAX, > +}; This looks similar to MEMCG_LRU_HEAD, MEMCG_LRU_TAIL, MEMCG_LRU_OLD, MEMCG_LRU_YOUNG of the memcg LRU? But now it's unaware of the aging event? > + > +#define MEMCG_HEAT_WARM 4 > +#define MEMCG_HEAT_HOT 8 > +#define MEMCG_HEAT_MAX 12 > +#define MEMCG_HEAT_DECAY_STEP 1 > +#define MEMCG_HEAT_DECAY_INTERVAL (1 * HZ) This is a hardcoded interval (1s), but memcg_decay_heat is driven by reclaim which is kind of random, could be very frequent or not happening at all, that doesn't look pretty by first look. > + > +static void memcg_adjust_heat(struct mem_cgroup_per_node *pn, long delta) > +{ > + long heat, new_heat; > + > + if (mem_cgroup_is_root(pn->memcg)) > + return; > + > + heat = atomic_long_read(&pn->reclaim.heat); > + do { > + new_heat = clamp_t(long, heat + delta, 0, MEMCG_HEAT_MAX); The hotness range is 0 - 12, is that a suitable value for all setup and workloads? > + if (atomic_long_cmpxchg(&pn->reclaim.heat, heat, new_heat) == heat) > + break; > + heat = atomic_long_read(&pn->reclaim.heat); > + } while (1); > +} > + > +static void memcg_decay_heat(struct mem_cgroup_per_node *pn) > +{ > + unsigned long last; > + unsigned long now = jiffies; > + > + if (mem_cgroup_is_root(pn->memcg)) > + return; > + > + last = READ_ONCE(pn->reclaim.last_decay); > + if (!time_after(now, last + MEMCG_HEAT_DECAY_INTERVAL)) > + return; > + > + if (cmpxchg(&pn->reclaim.last_decay, last, now) != last) > + return; > + > + memcg_adjust_heat(pn, -MEMCG_HEAT_DECAY_STEP); > +} > + > +static int memcg_heat_level(struct mem_cgroup_per_node *pn) > +{ > + long heat; > + > + if (mem_cgroup_is_root(pn->memcg)) > + return MEMCG_LEVEL_COLD; > + > + memcg_decay_heat(pn); > + heat = atomic_long_read(&pn->reclaim.heat); > + > + if (heat >= MEMCG_HEAT_HOT) > + return MEMCG_LEVEL_HOT; > + if (heat >= MEMCG_HEAT_WARM) > + return MEMCG_LEVEL_WARM; > + return MEMCG_LEVEL_COLD; > +} > + > +static void memcg_record_reclaim_result(struct mem_cgroup_per_node *pn, > + struct lruvec *lruvec, > + unsigned long scanned, > + unsigned long reclaimed) > +{ > + long delta; > + > + if (mem_cgroup_is_root(pn->memcg)) > + return; > + > + memcg_decay_heat(pn); > + > + /* > + * Memory cgroup heat adjustment algorithm: > + * - If scanned == 0: mark as hottest (+MAX_HEAT) > + * - If reclaimed >= 50% * scanned: strong cool (-2) > + * - If reclaimed >= 25% * scanned: mild cool (-1) > + * - Otherwise: warm up (+1) The naming is bit of confusing I think, no scan doesn't mean it's all hot. Maybe you mean no reclaim? No scan could also mean a empty memcg? > + */ > + if (!scanned) > + delta = MEMCG_HEAT_MAX; > + else if (reclaimed * 2 >= scanned) > + delta = -2; > + else if (reclaimed * 4 >= scanned) > + delta = -1; > + else > + delta = 1; > + > + /* > + * Refault-based heat adjustment: > + * - If refault increase > reclaimed pages: heat up (more cautious reclaim) > + * - If no refaults and currently warm: cool down (allow more reclaim) > + * This prevents thrashing by backing off when refaults indicate over-reclaim. > + */ > + if (lruvec) { > + unsigned long total_refaults; > + unsigned long prev; > + long refault_delta; > + > + total_refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_ANON); > + total_refaults += lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_FILE); I think you want WORKINGSET_REFAULT_* or WORKINGSET_RESTORE_* here. > + > + prev = atomic_long_xchg(&pn->reclaim.last_refault, total_refaults); > + refault_delta = total_refaults - prev; > + > + if (refault_delta > reclaimed) > + delta++; > + else if (!refault_delta && delta > 0) > + delta--; > + } > + > + memcg_adjust_heat(pn, delta); > +} > + > static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > { > struct mem_cgroup *target_memcg = sc->target_mem_cgroup; > @@ -5986,7 +6104,8 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > }; > struct mem_cgroup_reclaim_cookie *partial = &reclaim; > struct mem_cgroup *memcg; > - > + int level; > + int max_level = root_reclaim(sc) ? MEMCG_LEVEL_MAX : MEMCG_LEVEL_WARM; Why limit to MEMCG_LEVEL_WARM when it's not a root reclaim? > /* > * In most cases, direct reclaimers can do partial walks > * through the cgroup tree, using an iterator state that > @@ -5999,62 +6118,80 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) > if (current_is_kswapd() || sc->memcg_full_walk) > partial = NULL; > > - memcg = mem_cgroup_iter(target_memcg, NULL, partial); > - do { > - struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > - unsigned long reclaimed; > - unsigned long scanned; > - > - /* > - * This loop can become CPU-bound when target memcgs > - * aren't eligible for reclaim - either because they > - * don't have any reclaimable pages, or because their > - * memory is explicitly protected. Avoid soft lockups. > - */ > - cond_resched(); > + for (level = MEMCG_LEVEL_COLD; level < max_level; level++) { > + bool need_next_level = false; > > - mem_cgroup_calculate_protection(target_memcg, memcg); > + memcg = mem_cgroup_iter(target_memcg, NULL, partial); > + do { > + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); > + unsigned long reclaimed; > + unsigned long scanned; > + struct mem_cgroup_per_node *pn = memcg->nodeinfo[pgdat->node_id]; > > - if (mem_cgroup_below_min(target_memcg, memcg)) { > - /* > - * Hard protection. > - * If there is no reclaimable memory, OOM. > - */ > - continue; > - } else if (mem_cgroup_below_low(target_memcg, memcg)) { > /* > - * Soft protection. > - * Respect the protection only as long as > - * there is an unprotected supply > - * of reclaimable memory from other cgroups. > + * This loop can become CPU-bound when target memcgs > + * aren't eligible for reclaim - either because they > + * don't have any reclaimable pages, or because their > + * memory is explicitly protected. Avoid soft lockups. > */ > - if (!sc->memcg_low_reclaim) { > - sc->memcg_low_skipped = 1; > + cond_resched(); > + > + mem_cgroup_calculate_protection(target_memcg, memcg); > + > + if (mem_cgroup_below_min(target_memcg, memcg)) { > + /* > + * Hard protection. > + * If there is no reclaimable memory, OOM. > + */ > continue; > + } else if (mem_cgroup_below_low(target_memcg, memcg)) { > + /* > + * Soft protection. > + * Respect the protection only as long as > + * there is an unprotected supply > + * of reclaimable memory from other cgroups. > + */ > + if (!sc->memcg_low_reclaim) { > + sc->memcg_low_skipped = 1; > + continue; > + } > + memcg_memory_event(memcg, MEMCG_LOW); > } > - memcg_memory_event(memcg, MEMCG_LOW); > - } > > - reclaimed = sc->nr_reclaimed; > - scanned = sc->nr_scanned; > + if (root_reclaim(sc) && memcg_heat_level(pn) > level) { > + need_next_level = true; > + continue; > + } > > - shrink_lruvec(lruvec, sc); > + reclaimed = sc->nr_reclaimed; > + scanned = sc->nr_scanned; > > - shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, > - sc->priority); > + shrink_lruvec(lruvec, sc); > + if (!memcg || memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B)) If we might have memcg == NULL here, the pn = memcg->nodeinfo[pgdat->node_id] and other memcg operations above looks kind of dangerous. Also why check NR_SLAB_RECLAIMABLE_B if there wasn't such a check previously? Maybe worth a separate patch. > + shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, > + sc->priority); > > - /* Record the group's reclaim efficiency */ > - if (!sc->proactive) > - vmpressure(sc->gfp_mask, memcg, false, > - sc->nr_scanned - scanned, > - sc->nr_reclaimed - reclaimed); > + if (root_reclaim(sc)) > + memcg_record_reclaim_result(pn, lruvec, > + sc->nr_scanned - scanned, > + sc->nr_reclaimed - reclaimed); Why only record the reclaim result for root_reclaim? > > - /* If partial walks are allowed, bail once goal is reached */ > - if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) { > - mem_cgroup_iter_break(target_memcg, memcg); > + /* Record the group's reclaim efficiency */ > + if (!sc->proactive) > + vmpressure(sc->gfp_mask, memcg, false, > + sc->nr_scanned - scanned, > + sc->nr_reclaimed - reclaimed); > + > + /* If partial walks are allowed, bail once goal is reached */ > + if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) { > + mem_cgroup_iter_break(target_memcg, memcg); > + break; > + } > + } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial))); > + > + if (!need_next_level) > break; > - } > - } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial))); > + } IIUC you are iterating all the memcg's for up to MEMCG_LEVEL_MAX times and only reclaim certain memcg in each iteration. I think in theory some workload may have a higher overhead since there are actually more iterations, and will this break the reclaim fairness? > } > > static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > -- > 2.34.1