From: Leon Huang Fu <leon.huangfu@shopee.com>
To: linux-mm@kvack.org
Cc: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, muchun.song@linux.dev,
akpm@linux-foundation.org, joel.granados@kernel.org,
jack@suse.cz, laoar.shao@gmail.com, mclapinski@google.com,
kyle.meyer@hpe.com, corbet@lwn.net, lance.yang@linux.dev,
leon.huangfu@shopee.com, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Subject: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
Date: Wed, 5 Nov 2025 15:49:16 +0800 [thread overview]
Message-ID: <20251105074917.94531-1-leon.huangfu@shopee.com> (raw)
On high-core count systems, memory cgroup statistics can become stale
due to per-CPU caching and deferred aggregation. Monitoring tools and
management applications sometimes need guaranteed up-to-date statistics
at specific points in time to make accurate decisions.
This patch adds write handlers to both memory.stat and memory.numa_stat
files to allow userspace to explicitly force an immediate flush of
memory statistics. When "1" is written to either file, it triggers
__mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
all pending statistics for the cgroup and its descendants.
The write operation validates the input and only accepts the value "1",
returning -EINVAL for any other input.
Usage example:
# Force immediate flush before reading critical statistics
echo 1 > /sys/fs/cgroup/mygroup/memory.stat
cat /sys/fs/cgroup/mygroup/memory.stat
This provides several benefits:
1. On-demand accuracy: Tools can flush only when needed, avoiding
continuous overhead
2. Targeted flushing: Allows flushing specific cgroups when precision
is required for particular workloads
3. Integration flexibility: Monitoring scripts can decide when to pay
the flush cost based on their specific accuracy requirements
The implementation is shared between cgroup v1 and v2 interfaces,
with memory_stat_write() providing the common validation and flush
logic. Both memory.stat and memory.numa_stat use the same write
handler since they both benefit from forcing accurate statistics.
Documentation is updated to reflect that these files are now read-write
instead of read-only, with clear explanation of the write behavior.
Signed-off-by: Leon Huang Fu <leon.huangfu@shopee.com>
---
v1 -> v2:
- Flush stats when write the file (per Michal).
- https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/
Documentation/admin-guide/cgroup-v2.rst | 31 +++++++++++++++++--------
mm/memcontrol-v1.c | 2 ++
mm/memcontrol-v1.h | 1 +
mm/memcontrol.c | 13 +++++++++++
4 files changed, 37 insertions(+), 10 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3345961c30ac..2a4a81d2cc2f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
cgroup is within its effective low boundary, the cgroup's
memory won't be reclaimed unless there is no reclaimable
memory available in unprotected cgroups.
- Above the effective low boundary (or
+ Above the effective low boundary (or
effective min boundary if it is higher), pages are reclaimed
proportionally to the overage, reducing reclaim pressure for
smaller overages.
@@ -1525,11 +1525,17 @@ The following nested keys are defined.
generated on this file reflects only the local events.
memory.stat
- A read-only flat-keyed file which exists on non-root cgroups.
+ A read-write flat-keyed file which exists on non-root cgroups.
- This breaks down the cgroup's memory footprint into different
- types of memory, type-specific details, and other information
- on the state and past events of the memory management system.
+ Reading this file breaks down the cgroup's memory footprint into
+ different types of memory, type-specific details, and other
+ information on the state and past events of the memory management
+ system.
+
+ Writing the value "1" to this file forces an immediate flush of
+ memory statistics for this cgroup and its descendants, improving
+ the accuracy of subsequent reads. Any other value will result in
+ an error.
All memory amounts are in bytes.
@@ -1786,11 +1792,16 @@ The following nested keys are defined.
cgroup is mounted with the memory_hugetlb_accounting option).
memory.numa_stat
- A read-only nested-keyed file which exists on non-root cgroups.
+ A read-write nested-keyed file which exists on non-root cgroups.
+
+ Reading this file breaks down the cgroup's memory footprint into
+ different types of memory, type-specific details, and other
+ information per node on the state of the memory management system.
- This breaks down the cgroup's memory footprint into different
- types of memory, type-specific details, and other information
- per node on the state of the memory management system.
+ Writing the value "1" to this file forces an immediate flush of
+ memory statistics for this cgroup and its descendants, improving
+ the accuracy of subsequent reads. Any other value will result in
+ an error.
This is useful for providing visibility into the NUMA locality
information within an memcg since the pages are allowed to be
@@ -2173,7 +2184,7 @@ of the two is enforced.
cgroup writeback requires explicit support from the underlying
filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
-btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
attributed to the root cgroup.
There are inherent differences in memory and writeback management
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff742..8cab6b52424b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -2040,6 +2040,7 @@ struct cftype mem_cgroup_legacy_files[] = {
{
.name = "stat",
.seq_show = memory_stat_show,
+ .write_u64 = memory_stat_write,
},
{
.name = "force_empty",
@@ -2078,6 +2079,7 @@ struct cftype mem_cgroup_legacy_files[] = {
{
.name = "numa_stat",
.seq_show = memcg_numa_stat_show,
+ .write_u64 = memory_stat_write,
},
#endif
{
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 6358464bb416..1c92d58330aa 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -29,6 +29,7 @@ void drain_all_stock(struct mem_cgroup *root_memcg);
unsigned long memcg_events(struct mem_cgroup *memcg, int event);
unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
int memory_stat_show(struct seq_file *m, void *v);
+int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val);
void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c34029e92bab..d6a5d872fbcb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
return 0;
}
+int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+ if (val != 1)
+ return -EINVAL;
+
+ if (css)
+ css_rstat_flush(css);
+
+ return 0;
+}
+
#ifdef CONFIG_NUMA
static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
int item)
@@ -4666,11 +4677,13 @@ static struct cftype memory_files[] = {
{
.name = "stat",
.seq_show = memory_stat_show,
+ .write_u64 = memory_stat_write,
},
#ifdef CONFIG_NUMA
{
.name = "numa_stat",
.seq_show = memory_numa_stat_show,
+ .write_u64 = memory_stat_write,
},
#endif
{
--
2.51.2
next reply other threads:[~2025-11-05 7:50 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-05 7:49 Leon Huang Fu [this message]
2025-11-05 8:19 ` Michal Hocko
2025-11-05 8:39 ` Lance Yang
2025-11-05 8:51 ` Leon Huang Fu
2025-11-06 1:19 ` Shakeel Butt
2025-11-06 3:30 ` Leon Huang Fu
2025-11-06 5:35 ` JP Kobryn
2025-11-06 6:42 ` Leon Huang Fu
2025-11-06 23:55 ` Shakeel Butt
2025-11-10 6:37 ` Leon Huang Fu
2025-11-10 20:19 ` Yosry Ahmed
2025-11-06 17:02 ` JP Kobryn
2025-11-10 6:20 ` Leon Huang Fu
2025-11-10 19:24 ` JP Kobryn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251105074917.94531-1-leon.huangfu@shopee.com \
--to=leon.huangfu@shopee.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=joel.granados@kernel.org \
--cc=kyle.meyer@hpe.com \
--cc=lance.yang@linux.dev \
--cc=laoar.shao@gmail.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mclapinski@google.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox