From: Yuanchu Xie <yuanchu@google.com>
To: Chen Ridong <chenridong@huaweicloud.com>
Cc: akpm@linux-foundation.org, axelrasmussen@google.com,
weixugc@google.com, david@kernel.org,
lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, corbet@lwn.net, skhan@linuxfoundation.org,
hannes@cmpxchg.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, muchun.song@linux.dev,
zhengqi.arch@bytedance.com, linux-mm@kvack.org,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
cgroups@vger.kernel.org, lujialin4@huawei.com, ryncsn@gmail.com
Subject: Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim
Date: Fri, 6 Feb 2026 16:47:31 -0600 [thread overview]
Message-ID: <CAJj2-QEvrgQ+R-nc3LZ-cBfnzjakxfSgmNbqDa-RFBVOpdVaAQ@mail.gmail.com> (raw)
In-Reply-To: <20260120134256.2271710-2-chenridong@huaweicloud.com>
Hi Ridong,
Thanks for working to reconcile the gaps between the LRU implementations.
On Tue, Jan 20, 2026 at 7:57 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
> From: Chen Ridong <chenridong@huawei.com>
>
> The memcg LRU was originally introduced to improve scalability during
> global reclaim. However, it is complex and only works with gen lru
> global reclaim. Moreover, its implementation complexity has led to
> performance regressions when handling a large number of memory cgroups [1].
>
> This patch introduces a per-memcg heat level for reclaim, aiming to unify
> gen lru and traditional LRU global reclaim. The core idea is to track
> per-node per-memcg reclaim state, including heat, last_decay, and
> last_refault. The last_refault records the total reclaimed data from the
> previous memcg reclaim. The last_decay is a time-based parameter; the heat
> level decays over time if the memcg is not reclaimed again. Both last_decay
> and last_refault are used to calculate the current heat level when reclaim
> starts.
>
> Three reclaim heat levels are defined: cold, warm, and hot. Cold memcgs are
> reclaimed first; only if cold memcgs cannot reclaim enough pages, warm
> memcgs become eligible for reclaim. Hot memcgs are reclaimed last.
>
> While this design can be applied to all memcg reclaim scenarios, this patch
> is conservative and only introduces heat levels for traditional LRU global
> reclaim. Subsequent patches will replace the memcg LRU with
> heat-level-based reclaim.
>
> Based on tests provided by YU Zhao, traditional LRU global reclaim shows
> significant performance improvement with heat-level reclaim enabled.
>
> The results below are from a 2-hour run of the test [2].
>
> Throughput (number of requests) before after Change
> Total 1734169 2353717 +35%
>
> Tail latency (number of requests) before after Change
> [128s, inf) 1231 1057 -14%
> [64s, 128s) 586 444 -24%
> [32s, 64s) 1658 1061 -36%
> [16s, 32s) 4611 2863 -38%
Do you have any numbers comparing heat-based reclaim to memcg LRU? I
know Johannes suggested removing memcg LRU, and what you have here
applies to more reclaim scenarios.
>
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/
>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
> include/linux/memcontrol.h | 7 ++
> mm/memcontrol.c | 3 +
> mm/vmscan.c | 227 +++++++++++++++++++++++++++++--------
> 3 files changed, 192 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index af352cabedba..b293caf70034 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -76,6 +76,12 @@ struct memcg_vmstats;
> struct lruvec_stats_percpu;
> struct lruvec_stats;
>
> +struct memcg_reclaim_state {
> + atomic_long_t heat;
> + unsigned long last_decay;
> + atomic_long_t last_refault;
> +};
> +
> struct mem_cgroup_reclaim_iter {
> struct mem_cgroup *position;
> /* scan generation, increased every round-trip */
> @@ -114,6 +120,7 @@ struct mem_cgroup_per_node {
> CACHELINE_PADDING(_pad2_);
> unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
> struct mem_cgroup_reclaim_iter iter;
> + struct memcg_reclaim_state reclaim;
>
> #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
> /* slab stats for nmi context */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f2b87e02574e..675d49ad7e2c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3713,6 +3713,9 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>
> lruvec_init(&pn->lruvec);
> pn->memcg = memcg;
> + atomic_long_set(&pn->reclaim.heat, 0);
> + pn->reclaim.last_decay = jiffies;
> + atomic_long_set(&pn->reclaim.last_refault, 0);
>
> memcg->nodeinfo[node] = pn;
> return true;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4aa73f125772..3759cd52c336 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5978,6 +5978,124 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> return inactive_lru_pages > pages_for_compaction;
> }
>
> +enum memcg_scan_level {
> + MEMCG_LEVEL_COLD,
> + MEMCG_LEVEL_WARM,
> + MEMCG_LEVEL_HOT,
> + MEMCG_LEVEL_MAX,
> +};
> +
> +#define MEMCG_HEAT_WARM 4
> +#define MEMCG_HEAT_HOT 8
> +#define MEMCG_HEAT_MAX 12
> +#define MEMCG_HEAT_DECAY_STEP 1
> +#define MEMCG_HEAT_DECAY_INTERVAL (1 * HZ)
I agree with Kairui; I'm somewhat concerned about this fixed decay
interval and how it behaves with many memcgs or heavy pressure.
> +
> +static void memcg_adjust_heat(struct mem_cgroup_per_node *pn, long delta)
> +{
> + long heat, new_heat;
> +
> + if (mem_cgroup_is_root(pn->memcg))
> + return;
> +
> + heat = atomic_long_read(&pn->reclaim.heat);
> + do {
> + new_heat = clamp_t(long, heat + delta, 0, MEMCG_HEAT_MAX);
> + if (atomic_long_cmpxchg(&pn->reclaim.heat, heat, new_heat) == heat)
> + break;
> + heat = atomic_long_read(&pn->reclaim.heat);
> + } while (1);
> +}
> +
> +static void memcg_decay_heat(struct mem_cgroup_per_node *pn)
> +{
> + unsigned long last;
> + unsigned long now = jiffies;
> +
> + if (mem_cgroup_is_root(pn->memcg))
> + return;
> +
> + last = READ_ONCE(pn->reclaim.last_decay);
> + if (!time_after(now, last + MEMCG_HEAT_DECAY_INTERVAL))
> + return;
> +
> + if (cmpxchg(&pn->reclaim.last_decay, last, now) != last)
> + return;
> +
> + memcg_adjust_heat(pn, -MEMCG_HEAT_DECAY_STEP);
> +}
> +
> +static int memcg_heat_level(struct mem_cgroup_per_node *pn)
> +{
> + long heat;
> +
> + if (mem_cgroup_is_root(pn->memcg))
> + return MEMCG_LEVEL_COLD;
> +
> + memcg_decay_heat(pn);
The decay here is somewhat counterintuitive given the name memcg_heat_level.
> + heat = atomic_long_read(&pn->reclaim.heat);
> +
> + if (heat >= MEMCG_HEAT_HOT)
> + return MEMCG_LEVEL_HOT;
> + if (heat >= MEMCG_HEAT_WARM)
> + return MEMCG_LEVEL_WARM;
> + return MEMCG_LEVEL_COLD;
> +}
> +
> +static void memcg_record_reclaim_result(struct mem_cgroup_per_node *pn,
> + struct lruvec *lruvec,
> + unsigned long scanned,
> + unsigned long reclaimed)
> +{
> + long delta;
> +
> + if (mem_cgroup_is_root(pn->memcg))
> + return;
> +
> + memcg_decay_heat(pn);
Could you combine the decay and adjust later in this function?
> +
> + /*
> + * Memory cgroup heat adjustment algorithm:
> + * - If scanned == 0: mark as hottest (+MAX_HEAT)
> + * - If reclaimed >= 50% * scanned: strong cool (-2)
> + * - If reclaimed >= 25% * scanned: mild cool (-1)
> + * - Otherwise: warm up (+1)
> + */
> + if (!scanned)
> + delta = MEMCG_HEAT_MAX;
> + else if (reclaimed * 2 >= scanned)
> + delta = -2;
> + else if (reclaimed * 4 >= scanned)
> + delta = -1;
> + else
> + delta = 1;
> +
> + /*
> + * Refault-based heat adjustment:
> + * - If refault increase > reclaimed pages: heat up (more cautious reclaim)
> + * - If no refaults and currently warm: cool down (allow more reclaim)
> + * This prevents thrashing by backing off when refaults indicate over-reclaim.
> + */
> + if (lruvec) {
> + unsigned long total_refaults;
> + unsigned long prev;
> + long refault_delta;
> +
> + total_refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_ANON);
> + total_refaults += lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_FILE);
> +
> + prev = atomic_long_xchg(&pn->reclaim.last_refault, total_refaults);
> + refault_delta = total_refaults - prev;
> +
> + if (refault_delta > reclaimed)
> + delta++;
> + else if (!refault_delta && delta > 0)
> + delta--;
> + }
I think this metric is based more on the memcg's reclaimability than
on heat. Though the memcgs are grouped based on absolute metrics and
not relative to others.
> +
> + memcg_adjust_heat(pn, delta);
> +}
> +
> static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> {
> ...snip
> }
Thanks,
Yuanchu
next prev parent reply other threads:[~2026-02-06 22:47 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-20 13:42 [RFC PATCH -next 0/7] Introduce heat-level memcg reclaim Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim Chen Ridong
2026-01-21 7:53 ` Chen Ridong
2026-01-21 14:58 ` Kairui Song
2026-01-22 2:32 ` Chen Ridong
2026-02-06 22:47 ` Yuanchu Xie [this message]
2026-02-09 8:17 ` Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 2/7] mm/mglru: make calls to flush_reclaim_state() similar for MGLRU and non-MGLRU Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 3/7] mm/mglru: rename should_abort_scan to lru_gen_should_abort_scan Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 4/7] mm/mglru: extend lru_gen_shrink_lruvec to support root reclaim Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 5/7] mm/mglru: combine shrink_many into shrink_node_memcgs Chen Ridong
2026-01-21 8:13 ` Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 6/7] mm/mglru: remove memcg disable handling from lru_gen_shrink_node Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 7/7] mm/mglru: remove memcg lru Chen Ridong
2026-01-29 11:25 ` [RFC PATCH -next 0/7] Introduce heat-level memcg reclaim Chen Ridong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJj2-QEvrgQ+R-nc3LZ-cBfnzjakxfSgmNbqDa-RFBVOpdVaAQ@mail.gmail.com \
--to=yuanchu@google.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=cgroups@vger.kernel.org \
--cc=chenridong@huaweicloud.com \
--cc=corbet@lwn.net \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=lujialin4@huawei.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=skhan@linuxfoundation.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=weixugc@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox