Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yuanchu Xie <yuanchu@google.com>
To: Chen Ridong <chenridong@huaweicloud.com>
Cc: akpm@linux-foundation.org, axelrasmussen@google.com,
	weixugc@google.com,  david@kernel.org,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	 vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com,  corbet@lwn.net, skhan@linuxfoundation.org,
	hannes@cmpxchg.org,  roman.gushchin@linux.dev,
	shakeel.butt@linux.dev, muchun.song@linux.dev,
	 zhengqi.arch@bytedance.com, linux-mm@kvack.org,
	linux-doc@vger.kernel.org,  linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, lujialin4@huawei.com,  ryncsn@gmail.com
Subject: Re: [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim
Date: Fri, 6 Feb 2026 16:47:31 -0600	[thread overview]
Message-ID: <CAJj2-QEvrgQ+R-nc3LZ-cBfnzjakxfSgmNbqDa-RFBVOpdVaAQ@mail.gmail.com> (raw)
In-Reply-To: <20260120134256.2271710-2-chenridong@huaweicloud.com>

Hi Ridong,

Thanks for working to reconcile the gaps between the LRU implementations.

On Tue, Jan 20, 2026 at 7:57 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
> From: Chen Ridong <chenridong@huawei.com>
>
> The memcg LRU was originally introduced to improve scalability during
> global reclaim. However, it is complex and only works with gen lru
> global reclaim. Moreover, its implementation complexity has led to
> performance regressions when handling a large number of memory cgroups [1].
>
> This patch introduces a per-memcg heat level for reclaim, aiming to unify
> gen lru and traditional LRU global reclaim. The core idea is to track
> per-node per-memcg reclaim state, including heat, last_decay, and
> last_refault. The last_refault records the total reclaimed data from the
> previous memcg reclaim. The last_decay is a time-based parameter; the heat
> level decays over time if the memcg is not reclaimed again. Both last_decay
> and last_refault are used to calculate the current heat level when reclaim
> starts.
>
> Three reclaim heat levels are defined: cold, warm, and hot. Cold memcgs are
> reclaimed first; only if cold memcgs cannot reclaim enough pages, warm
> memcgs become eligible for reclaim. Hot memcgs are reclaimed last.
>
> While this design can be applied to all memcg reclaim scenarios, this patch
> is conservative and only introduces heat levels for traditional LRU global
> reclaim. Subsequent patches will replace the memcg LRU with
> heat-level-based reclaim.
>
> Based on tests provided by YU Zhao, traditional LRU global reclaim shows
> significant performance improvement with heat-level reclaim enabled.
>
> The results below are from a 2-hour run of the test [2].
>
> Throughput (number of requests)         before     after        Change
> Total                                   1734169    2353717      +35%
>
> Tail latency (number of requests)       before     after        Change
> [128s, inf)                             1231       1057         -14%
> [64s, 128s)                             586        444          -24%
> [32s, 64s)                              1658       1061         -36%
> [16s, 32s)                              4611       2863         -38%

Do you have any numbers comparing heat-based reclaim to memcg LRU?  I
know Johannes suggested removing memcg LRU, and what you have here
applies to more reclaim scenarios.

>
> [1] https://lore.kernel.org/r/20251126171513.GC135004@cmpxchg.org
> [2] https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/
>
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>  include/linux/memcontrol.h |   7 ++
>  mm/memcontrol.c            |   3 +
>  mm/vmscan.c                | 227 +++++++++++++++++++++++++++++--------
>  3 files changed, 192 insertions(+), 45 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index af352cabedba..b293caf70034 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -76,6 +76,12 @@ struct memcg_vmstats;
>  struct lruvec_stats_percpu;
>  struct lruvec_stats;
>
> +struct memcg_reclaim_state {
> +       atomic_long_t heat;
> +       unsigned long last_decay;
> +       atomic_long_t last_refault;
> +};
> +
>  struct mem_cgroup_reclaim_iter {
>         struct mem_cgroup *position;
>         /* scan generation, increased every round-trip */
> @@ -114,6 +120,7 @@ struct mem_cgroup_per_node {
>         CACHELINE_PADDING(_pad2_);
>         unsigned long           lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
>         struct mem_cgroup_reclaim_iter  iter;
> +       struct memcg_reclaim_state      reclaim;
>
>  #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
>         /* slab stats for nmi context */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f2b87e02574e..675d49ad7e2c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3713,6 +3713,9 @@ static bool alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
>
>         lruvec_init(&pn->lruvec);
>         pn->memcg = memcg;
> +       atomic_long_set(&pn->reclaim.heat, 0);
> +       pn->reclaim.last_decay = jiffies;
> +       atomic_long_set(&pn->reclaim.last_refault, 0);
>
>         memcg->nodeinfo[node] = pn;
>         return true;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4aa73f125772..3759cd52c336 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5978,6 +5978,124 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>         return inactive_lru_pages > pages_for_compaction;
>  }
>
> +enum memcg_scan_level {
> +       MEMCG_LEVEL_COLD,
> +       MEMCG_LEVEL_WARM,
> +       MEMCG_LEVEL_HOT,
> +       MEMCG_LEVEL_MAX,
> +};
> +
> +#define MEMCG_HEAT_WARM                4
> +#define MEMCG_HEAT_HOT         8
> +#define MEMCG_HEAT_MAX         12
> +#define MEMCG_HEAT_DECAY_STEP  1
> +#define MEMCG_HEAT_DECAY_INTERVAL      (1 * HZ)
I agree with Kairui; I'm somewhat concerned about this fixed decay
interval and how it behaves with many memcgs or heavy pressure.

> +
> +static void memcg_adjust_heat(struct mem_cgroup_per_node *pn, long delta)
> +{
> +       long heat, new_heat;
> +
> +       if (mem_cgroup_is_root(pn->memcg))
> +               return;
> +
> +       heat = atomic_long_read(&pn->reclaim.heat);
> +       do {
> +               new_heat = clamp_t(long, heat + delta, 0, MEMCG_HEAT_MAX);
> +               if (atomic_long_cmpxchg(&pn->reclaim.heat, heat, new_heat) == heat)
> +                       break;
> +               heat = atomic_long_read(&pn->reclaim.heat);
> +       } while (1);
> +}
> +
> +static void memcg_decay_heat(struct mem_cgroup_per_node *pn)
> +{
> +       unsigned long last;
> +       unsigned long now = jiffies;
> +
> +       if (mem_cgroup_is_root(pn->memcg))
> +               return;
> +
> +       last = READ_ONCE(pn->reclaim.last_decay);
> +       if (!time_after(now, last + MEMCG_HEAT_DECAY_INTERVAL))
> +               return;
> +
> +       if (cmpxchg(&pn->reclaim.last_decay, last, now) != last)
> +               return;
> +
> +       memcg_adjust_heat(pn, -MEMCG_HEAT_DECAY_STEP);
> +}
> +
> +static int memcg_heat_level(struct mem_cgroup_per_node *pn)
> +{
> +       long heat;
> +
> +       if (mem_cgroup_is_root(pn->memcg))
> +               return MEMCG_LEVEL_COLD;
> +
> +       memcg_decay_heat(pn);
The decay here is somewhat counterintuitive given the name memcg_heat_level.

> +       heat = atomic_long_read(&pn->reclaim.heat);
> +
> +       if (heat >= MEMCG_HEAT_HOT)
> +               return MEMCG_LEVEL_HOT;
> +       if (heat >= MEMCG_HEAT_WARM)
> +               return MEMCG_LEVEL_WARM;
> +       return MEMCG_LEVEL_COLD;
> +}
> +
> +static void memcg_record_reclaim_result(struct mem_cgroup_per_node *pn,
> +                                       struct lruvec *lruvec,
> +                                       unsigned long scanned,
> +                                       unsigned long reclaimed)
> +{
> +       long delta;
> +
> +       if (mem_cgroup_is_root(pn->memcg))
> +               return;
> +
> +       memcg_decay_heat(pn);
Could you combine the decay and adjust later in this function?

> +
> +       /*
> +        * Memory cgroup heat adjustment algorithm:
> +        * - If scanned == 0: mark as hottest (+MAX_HEAT)
> +        * - If reclaimed >= 50% * scanned: strong cool (-2)
> +        * - If reclaimed >= 25% * scanned: mild cool (-1)
> +        * - Otherwise:  warm up (+1)
> +        */
> +       if (!scanned)
> +               delta = MEMCG_HEAT_MAX;
> +       else if (reclaimed * 2 >= scanned)
> +               delta = -2;
> +       else if (reclaimed * 4 >= scanned)
> +               delta = -1;
> +       else
> +               delta = 1;
> +
> +       /*
> +        * Refault-based heat adjustment:
> +        * - If refault increase > reclaimed pages: heat up (more cautious reclaim)
> +        * - If no refaults and currently warm:     cool down (allow more reclaim)
> +        * This prevents thrashing by backing off when refaults indicate over-reclaim.
> +        */
> +       if (lruvec) {
> +               unsigned long total_refaults;
> +               unsigned long prev;
> +               long refault_delta;
> +
> +               total_refaults = lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_ANON);
> +               total_refaults += lruvec_page_state(lruvec, WORKINGSET_ACTIVATE_FILE);
> +
> +               prev = atomic_long_xchg(&pn->reclaim.last_refault, total_refaults);
> +               refault_delta = total_refaults - prev;
> +
> +               if (refault_delta > reclaimed)
> +                       delta++;
> +               else if (!refault_delta && delta > 0)
> +                       delta--;
> +       }

I think this metric is based more on the memcg's reclaimability than
on heat. Though the memcgs are grouped based on absolute metrics and
not relative to others.

> +
> +       memcg_adjust_heat(pn, delta);
> +}
> +
>  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  {
> ...snip
>  }

Thanks,
Yuanchu

next prev parent reply	other threads:[~2026-02-06 22:47 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-20 13:42 [RFC PATCH -next 0/7] Introduce heat-level memcg reclaim Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 1/7] vmscan: add memcg heat level for reclaim Chen Ridong
2026-01-21  7:53   ` Chen Ridong
2026-01-21 14:58   ` Kairui Song
2026-01-22  2:32     ` Chen Ridong
2026-02-06 22:47   ` Yuanchu Xie [this message]
2026-02-09  8:17     ` Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 2/7] mm/mglru: make calls to flush_reclaim_state() similar for MGLRU and non-MGLRU Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 3/7] mm/mglru: rename should_abort_scan to lru_gen_should_abort_scan Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 4/7] mm/mglru: extend lru_gen_shrink_lruvec to support root reclaim Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 5/7] mm/mglru: combine shrink_many into shrink_node_memcgs Chen Ridong
2026-01-21  8:13   ` Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 6/7] mm/mglru: remove memcg disable handling from lru_gen_shrink_node Chen Ridong
2026-01-20 13:42 ` [RFC PATCH -next 7/7] mm/mglru: remove memcg lru Chen Ridong
2026-01-29 11:25 ` [RFC PATCH -next 0/7] Introduce heat-level memcg reclaim Chen Ridong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJj2-QEvrgQ+R-nc3LZ-cBfnzjakxfSgmNbqDa-RFBVOpdVaAQ@mail.gmail.com \
    --to=yuanchu@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=lujialin4@huawei.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox