From: Shakeel Butt <shakeel.butt@linux.dev>
To: "Chen, Yu C" <yu.c.chen@intel.com>
Cc: "Michal Koutný" <mkoutny@suse.com>,
peterz@infradead.org, akpm@linux-foundation.org,
mingo@redhat.com, tj@kernel.org, hannes@cmpxchg.org,
corbet@lwn.net, mgorman@suse.de, mhocko@kernel.org,
muchun.song@linux.dev, roman.gushchin@linux.dev,
tim.c.chen@intel.com, aubrey.li@intel.com, libo.chen@oracle.com,
kprateek.nayak@amd.com, vineethr@linux.ibm.com,
venkat88@linux.ibm.com, ayushjai@amd.com,
cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
yu.chen.surf@foxmail.com
Subject: Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
Date: Tue, 27 May 2025 11:15:33 -0700 [thread overview]
Message-ID: <fpa42ohp54ewxxymaclnmiafdlfs7lbddnqhtv7haksdd5jq6z@mb6jxk3pl2m2> (raw)
In-Reply-To: <a8314889-f036-49ff-9cda-01367ddccf51@intel.com>
On Tue, May 27, 2025 at 05:20:54PM +0800, Chen, Yu C wrote:
> On 5/26/2025 9:35 PM, Michal Koutný wrote:
> > On Fri, May 23, 2025 at 04:42:50PM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > Hmm these are scheduler events, how are these relevant to memory cgroup
> > > or vmstat? Any reason to not expose these in cpu.stat?
> >
> > Good point. If I take it further -- this functionality needs neither
> > memory controller (CONFIG_MEMCG) nor CPU controller
> > (CONFIG_CGROUP_SCHED), so it might be technically calculated and exposed
> > in _any_ cgroup (which would be same technical solution how cpu time is
> > counted in cpu.stat regardless of CPU controller, cpu_stat_show()).
> >
>
> Yes, we can add it to cpu.stat. However, this might make it more difficult
> for users to locate related events. Some statistics about NUMA page
> migrations/faults are recorded in memory.stat, while others about NUMA task
> migrations (triggered by NUMA faults periodicly) are stored in cpu.stat.
>
> Do you recommend extending the struct cgroup_base_stat to include counters
> for task_migrate/task_swap? Additionally, should we enhance
> cgroup_base_stat_cputime_show() to parse task_migrate/task_swap in a manner
> similar to cputime?
>
> Alternatively, as Shakeel previously mentioned, could we reuse
> "count_memcg_event_mm()" and related infrastructure while exposing these
> statistics/events in cpu.stat? I assume Shakeel was referring to the
> following
> approach:
>
> 1. Skip task migration/swap in memory.stat:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cdaab8a957f3..b8eea3eca46f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1529,6 +1529,11 @@ static void memcg_stat_format(struct mem_cgroup
> *memcg, struct seq_buf *s)
> if (memcg_vm_event_stat[i] == PGPGIN ||
> memcg_vm_event_stat[i] == PGPGOUT)
> continue;
> +#endif
> +#ifdef CONFIG_NUMA_BALANCING
> + if (memcg_vm_event_stat[i] == NUMA_TASK_MIGRATE ||
> + memcg_vm_event_stat[i] == NUMA_TASK_SWAP)
> + continue;
> #endif
>
> 2.Skip task migration/swap in /proc/vmstat
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ed08bb384ae4..ea8a8ae1cdac 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1912,6 +1912,10 @@ static void *vmstat_next(struct seq_file *m, void
> *arg, loff_t *pos)
> (*pos)++;
> if (*pos >= NR_VMSTAT_ITEMS)
> return NULL;
> +#ifdef CONFIG_NUMA_BALANCING
> + if (*pos == NUMA_TASK_MIGRATE || *pos == NUMA_TASK_SWAP)
> + return NULL;
> +#endif
>
> 3. Display task migration/swap events in cpu.stat:
> seq_buf_printf(&s, "%s %lu\n",
> + vm_event_name(memcg_vm_event_stat[NUMA_TASK_MIGRATE]),
> + memcg_events(memcg,
> memcg_vm_event_stat[NUMA_TASK_MIGRATE]));
>
You would need to use memcg_events() and you will need to flush the
memcg rstat trees as well
>
> It looks like more code is needed. Michal, Shakeel, could you please advise
> which strategy is preferred, or should we keep the current version?
I am now more inclined to keep these new stats in memory.stat as the
current version is doing because:
1. Relevant stats are exposed through the same interface and we already
have numa balancing stats in memory.stat.
2. There is no single good home for these new stats and exposing them in
cpu.stat would require more code and even if we reuse memcg infra, we
would still need to flush the memcg stats, so why not just expose in
the memory.stat.
3. Though a bit far fetched, I think we may add more stats which sit at
the boundary of sched and mm in future. Numa balancing is one
concrete example of such stats. I am envisioning for reliable memory
reclaim or overcommit, there might be some useful events as well.
Anyways it is still unbaked atm.
Michal, let me know your thought on this.
next prev parent reply other threads:[~2025-05-27 18:16 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-23 12:48 [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Chen Yu
2025-05-23 12:51 ` [PATCH v5 1/2] sched/numa: fix task swap by skipping kernel threads Chen Yu
2025-05-23 23:22 ` Shakeel Butt
2025-05-23 12:51 ` [PATCH v5 2/2] sched/numa: add statistics of numa balance task Chen Yu
2025-05-23 23:42 ` Shakeel Butt
2025-05-24 9:07 ` Chen, Yu C
2025-05-24 17:32 ` Shakeel Butt
2025-05-25 12:35 ` Chen, Yu C
2025-05-27 17:48 ` Shakeel Butt
2025-05-29 5:04 ` Chen, Yu C
2025-05-26 13:35 ` Michal Koutný
2025-05-27 9:20 ` Chen, Yu C
2025-05-27 18:15 ` Shakeel Butt [this message]
2025-06-02 16:53 ` Michal Koutný
2025-06-03 14:46 ` Chen, Yu C
2025-06-17 9:30 ` Michal Koutný
2025-06-19 13:03 ` Chen, Yu C
2025-06-19 14:06 ` Michal Koutný
2025-05-23 22:06 ` [PATCH v5 0/2] sched/numa: add statistics of numa balance task migration Andrew Morton
2025-05-23 23:52 ` Shakeel Butt
2025-05-28 0:21 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fpa42ohp54ewxxymaclnmiafdlfs7lbddnqhtv7haksdd5jq6z@mb6jxk3pl2m2 \
--to=shakeel.butt@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=aubrey.li@intel.com \
--cc=ayushjai@amd.com \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=kprateek.nayak@amd.com \
--cc=libo.chen@oracle.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mhocko@kernel.org \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=peterz@infradead.org \
--cc=roman.gushchin@linux.dev \
--cc=tim.c.chen@intel.com \
--cc=tj@kernel.org \
--cc=venkat88@linux.ibm.com \
--cc=vineethr@linux.ibm.com \
--cc=yu.c.chen@intel.com \
--cc=yu.chen.surf@foxmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox