linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Leon Huang Fu <leon.huangfu@shopee.com>
To: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, hannes@cmpxchg.org, roman.gushchin@linux.dev,
	 shakeel.butt@linux.dev, muchun.song@linux.dev,
	akpm@linux-foundation.org,  joel.granados@kernel.org,
	jack@suse.cz, laoar.shao@gmail.com,  mclapinski@google.com,
	kyle.meyer@hpe.com, corbet@lwn.net,  lance.yang@linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	 cgroups@vger.kernel.org
Subject: Re: [PATCH mm-new] mm/memcontrol: Introduce sysctl vm.memcg_stats_flush_threshold
Date: Wed, 5 Nov 2025 14:01:33 +0800	[thread overview]
Message-ID: <CAPV86rrt0YT-npNSBJ_eHvAYdr_j1qkN7H+J4QLN8zsfi5TJ4w@mail.gmail.com> (raw)
In-Reply-To: <aQnFn6vPQ5D6STGw@tiehlicka>

On Tue, Nov 4, 2025 at 5:21 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 04-11-25 11:19:08, Leon Huang Fu wrote:
> > The current implementation uses a flush threshold calculated as
> > MEMCG_CHARGE_BATCH * num_online_cpus() for determining when to
> > aggregate per-CPU memory cgroup statistics. On systems with high core
> > counts, this threshold can become very large (e.g., 64 * 256 = 16,384
> > on a 256-core system), leading to stale statistics when userspace reads
> > memory.stat files.
> >
> > This is particularly problematic for monitoring and management tools
> > that rely on reasonably fresh statistics, as they may observe data that
> > is thousands of updates out of date.
> >
> > Introduce a new sysctl, vm.memcg_stats_flush_threshold, that allows
> > administrators to override the flush threshold specifically for
> > userspace reads of memory.stat. When set to 0 (default), the behavior
> > remains unchanged, using the automatic calculation. When set to a
> > non-zero value, userspace reads will use the custom threshold for more
> > frequent flushing.
>
> How are admins supposed to know how to tune this? Wouldn't it make more
> sense to allow explicit flushing on write to the file? That would allow
> admins to implement their preferred accuracy tuning by writing to the file
> when the precision is required.

Thank you for the feedback. Let me clarify the use case and design rationale.

The threshold approach is intended for scenarios where administrators want to
improve accuracy for existing monitoring tools on high core-count systems. On
such systems, the default threshold (MEMCG_CHARGE_BATCH * num_cpus) can reach
16K+ updates, causing monitoring dashboards to display stale data.

Regarding tunability: while the exact threshold value requires some
understanding, the principle is straightforward - lower values mean fresher
stats but higher overhead. Administrators can start conservatively (e.g.,
1/4 of the default: num_cpus * 16) and adjust based on observed overhead.

Your suggestion about allowing writes to memory.stat to trigger explicit
flushing is interesting. Comparing the two approaches:

- Threshold (this patch):
  - Administrator sets once system-wide via sysctl
  - Affects all memory.stat reads automatically
  - Tradeoff: harder to tune, always-on overhead

- Write-to-flush (your suggestion):
  - Tools write to memory.stat before reading: echo 1 > memory.stat
  - Per-cgroup, on-demand control
  - Tradeoff: requires tool modifications, but more precise control

Actually, your approach may be more elegant - tools pay the flush cost only
when they need accuracy, rather than imposing a system-wide policy. The
write-to-flush pattern is also more discoverable and self-documenting.

Let me try your approach in the next revision.

Thanks,
Leon

>
> --
> Michal Hocko
> SUSE Labs


      reply	other threads:[~2025-11-05  6:08 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-04  3:19 Leon Huang Fu
2025-11-04  9:21 ` Michal Hocko
2025-11-05  6:01   ` Leon Huang Fu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPV86rrt0YT-npNSBJ_eHvAYdr_j1qkN7H+J4QLN8zsfi5TJ4w@mail.gmail.com \
    --to=leon.huangfu@shopee.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=hannes@cmpxchg.org \
    --cc=jack@suse.cz \
    --cc=joel.granados@kernel.org \
    --cc=kyle.meyer@hpe.com \
    --cc=lance.yang@linux.dev \
    --cc=laoar.shao@gmail.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mclapinski@google.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox