linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Yosry Ahmed <yosry@kernel.org>
Cc: Qi Zheng <qi.zheng@linux.dev>,
	hannes@cmpxchg.org, hughd@google.com,  mhocko@suse.com,
	roman.gushchin@linux.dev, muchun.song@linux.dev,
	 david@kernel.org, lorenzo.stoakes@oracle.com, ziy@nvidia.com,
	harry.yoo@oracle.com,  yosry.ahmed@linux.dev,
	imran.f.khan@oracle.com, kamalesh.babulal@oracle.com,
	 axelrasmussen@google.com, yuanchu@google.com,
	weixugc@google.com,  chenridong@huaweicloud.com,
	mkoutny@suse.com, akpm@linux-foundation.org,
	 hamzamahfooz@linux.microsoft.com, apais@linux.microsoft.com,
	lance.yang@linux.dev, bhe@redhat.com,  usamaarif642@gmail.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 cgroups@vger.kernel.org, Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [PATCH v5 29/32] mm: memcontrol: prepare for reparenting non-hierarchical stats
Date: Thu, 26 Feb 2026 09:02:40 -0800	[thread overview]
Message-ID: <aaB7yYSpAaC5uInq@linux.dev> (raw)
In-Reply-To: <CAO9r8zPmgytmGHAbueFKXcZWY5SJaEwD3Pqk99ws4XeO2_hnKw@mail.gmail.com>

On Thu, Feb 26, 2026 at 07:16:50AM -0800, Yosry Ahmed wrote:
> > > Did you measure the impact of making state_local atomic on the flush
> > > path? It's a slow path but we've seen pain from it being too slow
> > > before, because it extends the critical section of the rstat flush
> > > lock.
> >
> > Qi, please measure the impact on flushing and if no impact then no need to do
> > anything as I don't want anymore churn in this series.
> >
> > >
> > > Can we keep this non-atomic and use mod_memcg_lruvec_state() here? It
> > > will update the stat on the local counter and it will be added to
> > > state_local in the flush path when needed. We can even force another
> > > flush in reparent_state_local () after reparenting is completed, if we
> > > want to avoid leaving a potentially large stat update pending, as it
> > > can be missed by mem_cgroup_flush_stats_ratelimited().
> > >
> > > Same for reparent_memcg_state_local(), we can probably use mod_memcg_state()?
> >
> > Yosry, do you mind sending the patch you are thinking about over this series?
> 
> Honestly, I'd rather squash it into this patch if possible. It avoids
> churn in the history (switch to atomics and back), and is arguably
> simpler than checking for regressions in the flush path.

Yup, let's squash it into the original patch. Please add your sign-off tag.

> 
> What I have in mind is the diff below (build tested only). Qi, would
> you be able to test this? It applies directly on this patch in mm-new:

Qi, please squash this diff into the patch and test. You might need to change
the subsequent patches. Once you are done with testing, you can post the diffs
for those in reply to those patches and we will ask Andrew to squash into
orinigal ones.

The diff looks good to me though.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d82dbfcc28057..404565e80cbf3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -234,11 +234,18 @@ static inline void reparent_state_local(struct
> mem_cgroup *memcg, struct mem_cgr
>         if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>                 return;
> 
> +       /*
> +        * Reparent stats exposed non-hierarchically. Flush @memcg's
> stats first to
> +        * read its stats accurately , and conservatively flush @parent's stats
> +        * after reparenting to avoid hiding a potentially large stat update
> +        * (e.g. from callers of mem_cgroup_flush_stats_ratelimited()).
> +        */
>         __mem_cgroup_flush_stats(memcg, true);
> 
> -       /* The following counts are all non-hierarchical and need to
> be reparented. */
>         reparent_memcg1_state_local(memcg, parent);
>         reparent_memcg1_lruvec_state_local(memcg, parent);
> +
> +       __mem_cgroup_flush_stats(parent, true);
>  }
>  #else
>  static inline void reparent_state_local(struct mem_cgroup *memcg,
> struct mem_cgroup *parent)
> @@ -442,7 +449,7 @@ struct lruvec_stats {
>         long state[NR_MEMCG_NODE_STAT_ITEMS];
> 
>         /* Non-hierarchical (CPU aggregated) state */
> -       atomic_long_t state_local[NR_MEMCG_NODE_STAT_ITEMS];
> +       long state_local[NR_MEMCG_NODE_STAT_ITEMS];
> 
>         /* Pending child counts during tree propagation */
>         long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
> @@ -485,7 +492,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>                 return 0;
> 
>         pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> -       x = atomic_long_read(&(pn->lruvec_stats->state_local[i]));
> +       x = READ_ONCE(pn->lruvec_stats->state_local[i]);
>  #ifdef CONFIG_SMP
>         if (x < 0)
>                 x = 0;
> @@ -493,6 +500,10 @@ unsigned long lruvec_page_state_local(struct
> lruvec *lruvec,
>         return x;
>  }
> 
> +static void mod_memcg_lruvec_state(struct lruvec *lruvec,
> +                                  enum node_stat_item idx,
> +                                  int val);
> +
>  #ifdef CONFIG_MEMCG_V1
>  void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
>                                        struct mem_cgroup *parent, int idx)
> @@ -506,12 +517,10 @@ void reparent_memcg_lruvec_state_local(struct
> mem_cgroup *memcg,
>         for_each_node(nid) {
>                 struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg,
> NODE_DATA(nid));
>                 struct lruvec *parent_lruvec =
> mem_cgroup_lruvec(parent, NODE_DATA(nid));
> -               struct mem_cgroup_per_node *parent_pn;
>                 unsigned long value =
> lruvec_page_state_local(child_lruvec, idx);
> 
> -               parent_pn = container_of(parent_lruvec, struct
> mem_cgroup_per_node, lruvec);
> -
> -               atomic_long_add(value,
> &(parent_pn->lruvec_stats->state_local[i]));
> +               mod_memcg_lruvec_state(child_lruvec, idx, -value);
> +               mod_memcg_lruvec_state(parent_lruvec, idx, value);
>         }
>  }
>  #endif
> @@ -598,7 +607,7 @@ struct memcg_vmstats {
>         unsigned long           events[NR_MEMCG_EVENTS];
> 
>         /* Non-hierarchical (CPU aggregated) page state & events */
> -       atomic_long_t           state_local[MEMCG_VMSTAT_SIZE];
> +       long                    state_local[MEMCG_VMSTAT_SIZE];
>         unsigned long           events_local[NR_MEMCG_EVENTS];
> 
>         /* Pending child counts during tree propagation */
> @@ -835,7 +844,7 @@ unsigned long memcg_page_state_local(struct
> mem_cgroup *memcg, int idx)
>         if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n",
> __func__, idx))
>                 return 0;
> 
> -       x = atomic_long_read(&(memcg->vmstats->state_local[i]));
> +       x = READ_ONCE(memcg->vmstats->state_local[i]);
>  #ifdef CONFIG_SMP
>         if (x < 0)
>                 x = 0;
> @@ -852,7 +861,8 @@ void reparent_memcg_state_local(struct mem_cgroup *memcg,
>         if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n",
> __func__, idx))
>                 return;
> 
> -       atomic_long_add(value, &(parent->vmstats->state_local[i]));
> +       mod_memcg_state(memcg, idx, -value);
> +       mod_memcg_state(parent, idx, value);
>  }
>  #endif
> 
> @@ -4174,8 +4184,6 @@ struct aggregate_control {
>         long *aggregate;
>         /* pointer to the non-hierarchichal (CPU aggregated) counters */
>         long *local;
> -       /* pointer to the atomic non-hierarchichal (CPU aggregated) counters */
> -       atomic_long_t *alocal;
>         /* pointer to the pending child counters during tree propagation */
>         long *pending;
>         /* pointer to the parent's pending counters, could be NULL */
> @@ -4213,12 +4221,8 @@ static void mem_cgroup_stat_aggregate(struct
> aggregate_control *ac)
>                 }
> 
>                 /* Aggregate counts on this level and propagate upwards */
> -               if (delta_cpu) {
> -                       if (ac->local)
> -                               ac->local[i] += delta_cpu;
> -                       else if (ac->alocal)
> -                               atomic_long_add(delta_cpu, &(ac->alocal[i]));
> -               }
> +               if (delta_cpu)
> +                       ac->local[i] += delta_cpu;
> 
>                 if (delta) {
>                         ac->aggregate[i] += delta;
> @@ -4289,8 +4293,7 @@ static void mem_cgroup_css_rstat_flush(struct
> cgroup_subsys_state *css, int cpu)
> 
>         ac = (struct aggregate_control) {
>                 .aggregate = memcg->vmstats->state,
> -               .local = NULL,
> -               .alocal = memcg->vmstats->state_local,
> +               .local = memcg->vmstats->state_local,
>                 .pending = memcg->vmstats->state_pending,
>                 .ppending = parent ? parent->vmstats->state_pending : NULL,
>                 .cstat = statc->state,
> @@ -4323,8 +4326,7 @@ static void mem_cgroup_css_rstat_flush(struct
> cgroup_subsys_state *css, int cpu)
> 
>                 ac = (struct aggregate_control) {
>                         .aggregate = lstats->state,
> -                       .local = NULL,
> -                       .alocal = lstats->state_local,
> +                       .local = lstats->state_local,
>                         .pending = lstats->state_pending,
>                         .ppending = plstats ? plstats->state_pending : NULL,
>                         .cstat = lstatc->state,


  reply	other threads:[~2026-02-26 17:03 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-25  7:48 [PATCH v5 00/32] Eliminate Dying Memory Cgroup Qi Zheng
2026-02-25  7:48 ` [PATCH v5 01/32] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
2026-02-25  7:48 ` [PATCH v5 02/32] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
2026-02-25  7:48 ` [PATCH v5 03/32] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
2026-02-25  7:48 ` [PATCH v5 04/32] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
2026-02-25  7:48 ` [PATCH v5 05/32] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
2026-02-25  7:48 ` [PATCH v5 06/32] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
2026-02-25  7:48 ` [PATCH v5 07/32] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
2026-02-25  7:48 ` [PATCH v5 08/32] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
2026-02-25  7:48 ` [PATCH v5 09/32] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
2026-02-25  7:48 ` [PATCH v5 10/32] writeback: prevent memory cgroup release in writeback module Qi Zheng
2026-02-25  7:48 ` [PATCH v5 11/32] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
2026-02-25  7:48 ` [PATCH v5 12/32] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
2026-02-25  7:52 ` [PATCH v5 13/32] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
2026-02-25  7:52 ` [PATCH v5 14/32] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
2026-02-25  7:52 ` [PATCH v5 15/32] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
2026-02-25  7:52 ` [PATCH v5 16/32] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
2026-02-25  7:53 ` [PATCH v5 17/32] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
2026-02-25  7:53 ` [PATCH v5 18/32] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
2026-02-25  7:53 ` [PATCH v5 19/32] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
2026-02-25  7:53 ` [PATCH v5 20/32] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
2026-02-25  7:53 ` [PATCH v5 21/32] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
2026-02-25  7:53 ` [PATCH v5 22/32] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
2026-02-25  7:53 ` [PATCH v5 23/32] mm: do not open-code lruvec lock Qi Zheng
2026-02-25  7:53 ` [PATCH v5 24/32] mm: memcontrol: prepare for reparenting LRU pages for " Qi Zheng
2026-02-25  7:53 ` [PATCH v5 25/32] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
2026-02-25  7:53 ` [PATCH v5 26/32] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
2026-02-25  7:53 ` [PATCH v5 27/32] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
2026-02-25  7:53 ` [PATCH v5 28/32] mm: workingset: use lruvec_lru_size() to get the number of lru pages Qi Zheng
2026-02-25  7:53 ` [PATCH v5 29/32] mm: memcontrol: prepare for reparenting non-hierarchical stats Qi Zheng
2026-02-25 14:58   ` Yosry Ahmed
2026-02-26  0:25     ` Shakeel Butt
2026-02-26  6:42       ` Qi Zheng
2026-02-26 15:16       ` Yosry Ahmed
2026-02-26 17:02         ` Shakeel Butt [this message]
2026-02-26 17:13           ` Yosry Ahmed
2026-02-26  6:41     ` Qi Zheng
2026-02-26  1:41   ` Shakeel Butt
2026-02-26  6:45     ` Qi Zheng
2026-02-25  7:53 ` [PATCH v5 30/32] mm: memcontrol: convert objcg to be per-memcg per-node type Qi Zheng
2026-02-25  9:44   ` [PATCH v5 update " Qi Zheng
2026-02-26  2:27     ` Shakeel Butt
2026-02-26  6:47       ` Qi Zheng
2026-02-26 20:05     ` Shakeel Butt
2026-02-25  7:53 ` [PATCH v5 31/32] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
2026-02-26  2:40   ` Shakeel Butt
2026-02-25  7:53 ` [PATCH v5 32/32] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
2026-02-25 21:57 ` [PATCH v5 00/32] Eliminate Dying Memory Cgroup Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aaB7yYSpAaC5uInq@linux.dev \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=apais@linux.microsoft.com \
    --cc=axelrasmussen@google.com \
    --cc=bhe@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=david@kernel.org \
    --cc=hamzamahfooz@linux.microsoft.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=hughd@google.com \
    --cc=imran.f.khan@oracle.com \
    --cc=kamalesh.babulal@oracle.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=qi.zheng@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=weixugc@google.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yosry@kernel.org \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox