[PATCH 0/2] improve per-node allocation and reclaim visibility

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] improve per-node allocation and reclaim visibility
@ 2026-02-12  4:51 JP Kobryn
  2026-02-12  4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: JP Kobryn @ 2026-02-12  4:51 UTC (permalink / raw)
  To: linux-mm
  Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david,
	eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett,
	linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt,
	muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

We sometimes find ourselves in situations where reclaim kicks in, yet there
is free memory available on the system. One possible explanation is that a
NUMA node under pressure has triggered the reclaim. This NUMA imbalance
scenario could be made easier to diagnose if we had better visibility.

This series aims to provide that visibility by accounting for the cause and
effect of the imbalance. First, the addition of new node stats allows for
tracking of allocations done on a per-policy basis. If a node is under
pressure, these stats can help reveal the cause of how it got there.
Second, the stats associated with reclaim are changed from vm_event_item to
node_stat_item. Having the pgsteal and pgscan counters tracked on a
per-node basis reveals the effect of any pressure, and allows us to quickly
narrow down the affected node(s).

JP Kobryn (2):
  mm/mempolicy: track page allocations per mempolicy
  mm: move pgscan and pgsteal to node stats

 drivers/virtio/virtio_balloon.c |  8 ++++----
 include/linux/mmzone.h          | 21 +++++++++++++++++++
 include/linux/vm_event_item.h   | 12 -----------
 mm/memcontrol.c                 | 36 ++++++++++++++++++---------------
 mm/mempolicy.c                  | 30 +++++++++++++++++++++++++--
 mm/vmscan.c                     | 32 +++++++++++------------------
 mm/vmstat.c                     | 33 +++++++++++++++++++-----------
 7 files changed, 106 insertions(+), 66 deletions(-)

-- 
2.47.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12  4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn
@ 2026-02-12  4:51 ` JP Kobryn
  2026-02-12  7:29   ` Michal Hocko
                     ` (3 more replies)
  2026-02-12  4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 27+ messages in thread
From: JP Kobryn @ 2026-02-12  4:51 UTC (permalink / raw)
  To: linux-mm
  Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david,
	eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett,
	linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt,
	muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

It would be useful to see a breakdown of allocations to understand which
NUMA policies are driving them. For example, when investigating memory
pressure, having policy-specific counts could show that allocations were
bound to the affected node (via MPOL_BIND).

Add per-policy page allocation counters as new node stat items. These
counters can provide correlation between a mempolicy and pressure on a
given node.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  9 +++++++++
 mm/mempolicy.c         | 30 ++++++++++++++++++++++++++++--
 mm/vmstat.c            |  9 +++++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc5d6c88d2f0..762609d5f0af 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,6 +255,15 @@ enum node_stat_item {
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
 	PGDEMOTE_PROACTIVE,
+#ifdef CONFIG_NUMA
+	PGALLOC_MPOL_DEFAULT,
+	PGALLOC_MPOL_PREFERRED,
+	PGALLOC_MPOL_BIND,
+	PGALLOC_MPOL_INTERLEAVE,
+	PGALLOC_MPOL_LOCAL,
+	PGALLOC_MPOL_PREFERRED_MANY,
+	PGALLOC_MPOL_WEIGHTED_INTERLEAVE,
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 68a98ba57882..3c64784af761 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -217,6 +217,21 @@ static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
 		new_iw[nid] /= iw_gcd;
 }
 
+#define CHECK_MPOL_NODE_STAT_OFFSET(mpol) \
+	BUILD_BUG_ON(PGALLOC_##mpol - mpol != PGALLOC_MPOL_DEFAULT)
+
+static enum node_stat_item mpol_node_stat(unsigned short mode)
+{
+	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_PREFERRED);
+	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_BIND);
+	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_INTERLEAVE);
+	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_LOCAL);
+	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_PREFERRED_MANY);
+	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_WEIGHTED_INTERLEAVE);
+
+	return PGALLOC_MPOL_DEFAULT + mode;
+}
+
 int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
 {
 	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
@@ -2446,8 +2461,14 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 
 	nodemask = policy_nodemask(gfp, pol, ilx, &nid);
 
-	if (pol->mode == MPOL_PREFERRED_MANY)
-		return alloc_pages_preferred_many(gfp, order, nid, nodemask);
+	if (pol->mode == MPOL_PREFERRED_MANY) {
+		page = alloc_pages_preferred_many(gfp, order, nid, nodemask);
+		if (page)
+			__mod_node_page_state(page_pgdat(page),
+					mpol_node_stat(MPOL_PREFERRED_MANY), 1 << order);
+
+		return page;
+	}
 
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
 	    /* filter "hugepage" allocation, unless from alloc_pages() */
@@ -2472,6 +2493,9 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 			page = __alloc_frozen_pages_noprof(
 				gfp | __GFP_THISNODE | __GFP_NORETRY, order,
 				nid, NULL);
+			if (page)
+				__mod_node_page_state(page_pgdat(page),
+						mpol_node_stat(pol->mode), 1 << order);
 			if (page || !(gfp & __GFP_DIRECT_RECLAIM))
 				return page;
 			/*
@@ -2484,6 +2508,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 	}
 
 	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
+	if (page)
+		__mod_node_page_state(page_pgdat(page), mpol_node_stat(pol->mode), 1 << order);
 
 	if (unlikely(pol->mode == MPOL_INTERLEAVE ||
 		     pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 65de88cdf40e..74e0ddde1e93 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1291,6 +1291,15 @@ const char * const vmstat_text[] = {
 	[I(PGDEMOTE_DIRECT)]			= "pgdemote_direct",
 	[I(PGDEMOTE_KHUGEPAGED)]		= "pgdemote_khugepaged",
 	[I(PGDEMOTE_PROACTIVE)]			= "pgdemote_proactive",
+#ifdef CONFIG_NUMA
+	[I(PGALLOC_MPOL_DEFAULT)]		= "pgalloc_mpol_default",
+	[I(PGALLOC_MPOL_PREFERRED)]		= "pgalloc_mpol_preferred",
+	[I(PGALLOC_MPOL_BIND)]			= "pgalloc_mpol_bind",
+	[I(PGALLOC_MPOL_INTERLEAVE)]		= "pgalloc_mpol_interleave",
+	[I(PGALLOC_MPOL_LOCAL)]			= "pgalloc_mpol_local",
+	[I(PGALLOC_MPOL_PREFERRED_MANY)]	= "pgalloc_mpol_preferred_many",
+	[I(PGALLOC_MPOL_WEIGHTED_INTERLEAVE)]	= "pgalloc_mpol_weighted_interleave",
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 	[I(NR_HUGETLB)]				= "nr_hugetlb",
 #endif
-- 
2.47.3



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12  4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
@ 2026-02-12  7:29   ` Michal Hocko
  2026-02-12 21:22     ` JP Kobryn
  2026-02-12 15:07   ` Shakeel Butt
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2026-02-12  7:29 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On Wed 11-02-26 20:51:08, JP Kobryn wrote:
> It would be useful to see a breakdown of allocations to understand which
> NUMA policies are driving them. For example, when investigating memory
> pressure, having policy-specific counts could show that allocations were
> bound to the affected node (via MPOL_BIND).
> 
> Add per-policy page allocation counters as new node stat items. These
> counters can provide correlation between a mempolicy and pressure on a
> given node.

Could you be more specific how exactly do you plan to use those
counters?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12  7:29   ` Michal Hocko
@ 2026-02-12 21:22     ` JP Kobryn
  2026-02-16  8:26       ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: JP Kobryn @ 2026-02-12 21:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/11/26 11:29 PM, Michal Hocko wrote:
> On Wed 11-02-26 20:51:08, JP Kobryn wrote:
>> It would be useful to see a breakdown of allocations to understand which
>> NUMA policies are driving them. For example, when investigating memory
>> pressure, having policy-specific counts could show that allocations were
>> bound to the affected node (via MPOL_BIND).
>>
>> Add per-policy page allocation counters as new node stat items. These
>> counters can provide correlation between a mempolicy and pressure on a
>> given node.
> 
> Could you be more specific how exactly do you plan to use those
> counters?

Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once
we identify the affected node(s), the new mpol counters (this patch)
allow us correlate the pressure to the mempolicy driving it.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12 21:22     ` JP Kobryn
@ 2026-02-16  8:26       ` Michal Hocko
  2026-02-16 17:50         ` JP Kobryn (Meta)
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2026-02-16  8:26 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On Thu 12-02-26 13:22:56, JP Kobryn wrote:
> On 2/11/26 11:29 PM, Michal Hocko wrote:
> > On Wed 11-02-26 20:51:08, JP Kobryn wrote:
> > > It would be useful to see a breakdown of allocations to understand which
> > > NUMA policies are driving them. For example, when investigating memory
> > > pressure, having policy-specific counts could show that allocations were
> > > bound to the affected node (via MPOL_BIND).
> > > 
> > > Add per-policy page allocation counters as new node stat items. These
> > > counters can provide correlation between a mempolicy and pressure on a
> > > given node.
> > 
> > Could you be more specific how exactly do you plan to use those
> > counters?
> 
> Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once
> we identify the affected node(s), the new mpol counters (this patch)
> allow us correlate the pressure to the mempolicy driving it.

I would appreciate somehow more specificity. You are adding counters
that are not really easy to drop once they are in. Sure we have
precedence of dropping some counters in the past so this is not as hard
as usual userspace APIs but still...

How exactly do you tolerate mempolicy allocations to specific nodes?
While MPOL_MBIND is quite straightforward others are less so.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-16  8:26       ` Michal Hocko
@ 2026-02-16 17:50         ` JP Kobryn (Meta)
  2026-02-16 21:07           ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: JP Kobryn (Meta) @ 2026-02-16 17:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/16/26 12:26 AM, Michal Hocko wrote:
> On Thu 12-02-26 13:22:56, JP Kobryn wrote:
>> On 2/11/26 11:29 PM, Michal Hocko wrote:
>>> On Wed 11-02-26 20:51:08, JP Kobryn wrote:
>>>> It would be useful to see a breakdown of allocations to understand which
>>>> NUMA policies are driving them. For example, when investigating memory
>>>> pressure, having policy-specific counts could show that allocations were
>>>> bound to the affected node (via MPOL_BIND).
>>>>
>>>> Add per-policy page allocation counters as new node stat items. These
>>>> counters can provide correlation between a mempolicy and pressure on a
>>>> given node.
>>>
>>> Could you be more specific how exactly do you plan to use those
>>> counters?
>>
>> Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once
>> we identify the affected node(s), the new mpol counters (this patch)
>> allow us correlate the pressure to the mempolicy driving it.
> 
> I would appreciate somehow more specificity. You are adding counters
> that are not really easy to drop once they are in. Sure we have
> precedence of dropping some counters in the past so this is not as hard
> as usual userspace APIs but still...
> 
> How exactly do you tolerate mempolicy allocations to specific nodes?
> While MPOL_MBIND is quite straightforward others are less so.

The design does account for this regardless of the policy. In the call
to __mod_node_page_state(), I'm using page_pgdat(page) so the stat is
attributed to the node where the page actually landed.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-16 17:50         ` JP Kobryn (Meta)
@ 2026-02-16 21:07           ` Michal Hocko
  2026-02-17  7:48             ` JP Kobryn (Meta)
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2026-02-16 21:07 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On Mon 16-02-26 09:50:26, JP Kobryn (Meta) wrote:
> On 2/16/26 12:26 AM, Michal Hocko wrote:
> > On Thu 12-02-26 13:22:56, JP Kobryn wrote:
> > > On 2/11/26 11:29 PM, Michal Hocko wrote:
> > > > On Wed 11-02-26 20:51:08, JP Kobryn wrote:
> > > > > It would be useful to see a breakdown of allocations to understand which
> > > > > NUMA policies are driving them. For example, when investigating memory
> > > > > pressure, having policy-specific counts could show that allocations were
> > > > > bound to the affected node (via MPOL_BIND).
> > > > > 
> > > > > Add per-policy page allocation counters as new node stat items. These
> > > > > counters can provide correlation between a mempolicy and pressure on a
> > > > > given node.
> > > > 
> > > > Could you be more specific how exactly do you plan to use those
> > > > counters?
> > > 
> > > Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once
> > > we identify the affected node(s), the new mpol counters (this patch)
> > > allow us correlate the pressure to the mempolicy driving it.
> > 
> > I would appreciate somehow more specificity. You are adding counters
> > that are not really easy to drop once they are in. Sure we have
> > precedence of dropping some counters in the past so this is not as hard
> > as usual userspace APIs but still...
> > 
> > How exactly do you tolerate mempolicy allocations to specific nodes?
> > While MPOL_MBIND is quite straightforward others are less so.
> 
> The design does account for this regardless of the policy. In the call
> to __mod_node_page_state(), I'm using page_pgdat(page) so the stat is
> attributed to the node where the page actually landed.

That much is clear[*]. The consumer side of things is not really clear to
me. How do you know which policy or part of the nodemask of that policy
is the source of the memory pressure on a particular node? In other
words how much is the data actually useful except for a single node
mempolicy (i.e. MBIND).

[*] btw. I believe you misaccount MPOL_LOCAL because you attribute the
target node even when the allocation is from a remote node from the
"local" POV.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-16 21:07           ` Michal Hocko
@ 2026-02-17  7:48             ` JP Kobryn (Meta)
  2026-02-17 12:37               ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: JP Kobryn (Meta) @ 2026-02-17  7:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/16/26 1:07 PM, Michal Hocko wrote:
> On Mon 16-02-26 09:50:26, JP Kobryn (Meta) wrote:
>> On 2/16/26 12:26 AM, Michal Hocko wrote:
>>> On Thu 12-02-26 13:22:56, JP Kobryn wrote:
>>>> On 2/11/26 11:29 PM, Michal Hocko wrote:
>>>>> On Wed 11-02-26 20:51:08, JP Kobryn wrote:
>>>>>> It would be useful to see a breakdown of allocations to understand which
>>>>>> NUMA policies are driving them. For example, when investigating memory
>>>>>> pressure, having policy-specific counts could show that allocations were
>>>>>> bound to the affected node (via MPOL_BIND).
>>>>>>
>>>>>> Add per-policy page allocation counters as new node stat items. These
>>>>>> counters can provide correlation between a mempolicy and pressure on a
>>>>>> given node.
>>>>>
>>>>> Could you be more specific how exactly do you plan to use those
>>>>> counters?
>>>>
>>>> Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once
>>>> we identify the affected node(s), the new mpol counters (this patch)
>>>> allow us correlate the pressure to the mempolicy driving it.
>>>
>>> I would appreciate somehow more specificity. You are adding counters
>>> that are not really easy to drop once they are in. Sure we have
>>> precedence of dropping some counters in the past so this is not as hard
>>> as usual userspace APIs but still...
>>>
>>> How exactly do you tolerate mempolicy allocations to specific nodes?
>>> While MPOL_MBIND is quite straightforward others are less so.
>>
>> The design does account for this regardless of the policy. In the call
>> to __mod_node_page_state(), I'm using page_pgdat(page) so the stat is
>> attributed to the node where the page actually landed.
> 
> That much is clear[*]. The consumer side of things is not really clear to
> me. How do you know which policy or part of the nodemask of that policy
> is the source of the memory pressure on a particular node? In other
> words how much is the data actually useful except for a single node
> mempolicy (i.e. MBIND).

Other than the bind policy, having the interleave (and weighted) stats
would allow us to see the effective distribution of the policy. Pressure
could be linked to a user configured weight scheme. I would think it
could also help with confirming expected distributions.

You brought up the node mask so with the preferred policy, I think this
is a good one for using the counters as well. Once we're at the point
where we know the node(s) under pressure and then see significant
preferred allocs accounted for, we could search the numa_maps that have
"prefer:<node>" to find the tasks targeting the affected nodes.

I mentioned this on another thread in this series but I'll include here
as well and expand some more. For any given policy, the workflow would
be:
1) Pressure/OOMs reported while system-wide memory is free.
2) Check per-node pgscan/pgsteal stats (provided by patch 2) to narrow
down node(s) under pressure. They become available in
/sys/devices/system/node/nodeN/vmstat.
3) Check per-policy allocation counters (this patch) on that node to
find what policy was driving it. Same readout at nodeN/vmstat.
4) Now use /proc/*/numa_maps to identify tasks using the policy.

> 
> [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the
> target node even when the allocation is from a remote node from the
> "local" POV.

It's a good point. The accounting as a result of fallback cases
shouldn't detract from an investigation though. We're interested in the
node(s) under pressure so the relatively few fallback allocations would
land on nodes that are not under pressure and could be viewed as
acceptable noise.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-17  7:48             ` JP Kobryn (Meta)
@ 2026-02-17 12:37               ` Michal Hocko
  2026-02-17 18:19                 ` JP Kobryn (Meta)
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2026-02-17 12:37 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On Mon 16-02-26 23:48:42, JP Kobryn (Meta) wrote:
> On 2/16/26 1:07 PM, Michal Hocko wrote:
[...]
> > [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the
> > target node even when the allocation is from a remote node from the
> > "local" POV.
> 
> It's a good point. The accounting as a result of fallback cases
> shouldn't detract from an investigation though. We're interested in the
> node(s) under pressure so the relatively few fallback allocations would
> land on nodes that are not under pressure and could be viewed as
> acceptable noise.

This is really confusing. You simply have no means to tell the
difference between the requested node and the real node used so you
cannot really say whether the memory pressure is because of fallbacks or
your mempolicy configurations. That means that you cannot tell the
difference between the source of the pressure and victim of that
pressure. 

I am not saying these scheme doesn't work in your particular setup but I
do not see this is long term maintainable thing. It is just too easy to
get misleading numbers. If we want/need to track mempolicy allocations
better than what existing numa_* counters offer then this needs to be
thought through I believe.

I do not think we should add these counters in this form. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-17 12:37               ` Michal Hocko
@ 2026-02-17 18:19                 ` JP Kobryn (Meta)
  2026-02-17 18:52                   ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: JP Kobryn (Meta) @ 2026-02-17 18:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/17/26 4:37 AM, Michal Hocko wrote:
> On Mon 16-02-26 23:48:42, JP Kobryn (Meta) wrote:
>> On 2/16/26 1:07 PM, Michal Hocko wrote:
> [...]
>>> [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the
>>> target node even when the allocation is from a remote node from the
>>> "local" POV.
>>
>> It's a good point. The accounting as a result of fallback cases
>> shouldn't detract from an investigation though. We're interested in the
>> node(s) under pressure so the relatively few fallback allocations would
>> land on nodes that are not under pressure and could be viewed as
>> acceptable noise.
> 
> This is really confusing. You simply have no means to tell the
> difference between the requested node and the real node used so you
> cannot really say whether the memory pressure is because of fallbacks or
> your mempolicy configurations. That means that you cannot tell the
> difference between the source of the pressure and victim of that
> pressure.

What if I excluded the fallback cases? I could get the actual node from
the allocated page and compare against the requested node or node mask.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-17 18:19                 ` JP Kobryn (Meta)
@ 2026-02-17 18:52                   ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2026-02-17 18:52 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On Tue 17-02-26 10:19:08, JP Kobryn (Meta) wrote:
> On 2/17/26 4:37 AM, Michal Hocko wrote:
> > On Mon 16-02-26 23:48:42, JP Kobryn (Meta) wrote:
> > > On 2/16/26 1:07 PM, Michal Hocko wrote:
> > [...]
> > > > [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the
> > > > target node even when the allocation is from a remote node from the
> > > > "local" POV.
> > > 
> > > It's a good point. The accounting as a result of fallback cases
> > > shouldn't detract from an investigation though. We're interested in the
> > > node(s) under pressure so the relatively few fallback allocations would
> > > land on nodes that are not under pressure and could be viewed as
> > > acceptable noise.
> > 
> > This is really confusing. You simply have no means to tell the
> > difference between the requested node and the real node used so you
> > cannot really say whether the memory pressure is because of fallbacks or
> > your mempolicy configurations. That means that you cannot tell the
> > difference between the source of the pressure and victim of that
> > pressure.
> 
> What if I excluded the fallback cases? I could get the actual node from
> the allocated page and compare against the requested node or node mask.

I think it would make sense to send the per-node reclaim stats
separately as there doesn't seem to be any dispute about that.

For mempolicy stats try to define semantic for each mempolicy first.
What exactly do you miss from existing numa_*?
Do you want to count number of requests/successes. Do you want to track
failures? In what kind of granularity (track fallback nodes)?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12  4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
  2026-02-12  7:29   ` Michal Hocko
@ 2026-02-12 15:07   ` Shakeel Butt
  2026-02-12 21:23     ` JP Kobryn
  2026-02-12 15:24   ` Vlastimil Babka
  2026-02-18  4:25   ` kernel test robot
  3 siblings, 1 reply; 27+ messages in thread
From: Shakeel Butt @ 2026-02-12 15:07 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim,
	roman.gushchin, surenb, virtualization, vbabka, weixugc,
	xuanzhuo, ying.huang, yuanchu, ziy, kernel-team

On Wed, Feb 11, 2026 at 08:51:08PM -0800, JP Kobryn wrote:
> It would be useful to see a breakdown of allocations to understand which
> NUMA policies are driving them. For example, when investigating memory
> pressure, having policy-specific counts could show that allocations were
> bound to the affected node (via MPOL_BIND).
> 
> Add per-policy page allocation counters as new node stat items. These
> counters can provide correlation between a mempolicy and pressure on a
> given node.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

[...]

>  int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
>  {
>  	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> @@ -2446,8 +2461,14 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  
>  	nodemask = policy_nodemask(gfp, pol, ilx, &nid);
>  
> -	if (pol->mode == MPOL_PREFERRED_MANY)
> -		return alloc_pages_preferred_many(gfp, order, nid, nodemask);
> +	if (pol->mode == MPOL_PREFERRED_MANY) {
> +		page = alloc_pages_preferred_many(gfp, order, nid, nodemask);
> +		if (page)
> +			__mod_node_page_state(page_pgdat(page),
> +					mpol_node_stat(MPOL_PREFERRED_MANY), 1 << order);

Here and two places below, please use mod_node_page_state() instead of
__mod_node_page_state() as __foo() requires preempt disable or if the
given stat can be updated in IRQ, then IRQ disable. This code path does
not do either of that.

> +
> +		return page;
> +	}
>  
>  	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
>  	    /* filter "hugepage" allocation, unless from alloc_pages() */
> @@ -2472,6 +2493,9 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  			page = __alloc_frozen_pages_noprof(
>  				gfp | __GFP_THISNODE | __GFP_NORETRY, order,
>  				nid, NULL);
> +			if (page)
> +				__mod_node_page_state(page_pgdat(page),
> +						mpol_node_stat(pol->mode), 1 << order);
>  			if (page || !(gfp & __GFP_DIRECT_RECLAIM))
>  				return page;
>  			/*
> @@ -2484,6 +2508,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  	}
>  
>  	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
> +	if (page)
> +		__mod_node_page_state(page_pgdat(page), mpol_node_stat(pol->mode), 1 << order);
>  


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12 15:07   ` Shakeel Butt
@ 2026-02-12 21:23     ` JP Kobryn
  0 siblings, 0 replies; 27+ messages in thread
From: JP Kobryn @ 2026-02-12 21:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim,
	roman.gushchin, surenb, virtualization, vbabka, weixugc,
	xuanzhuo, ying.huang, yuanchu, ziy, kernel-team

On 2/12/26 7:07 AM, Shakeel Butt wrote:
> On Wed, Feb 11, 2026 at 08:51:08PM -0800, JP Kobryn wrote:
>> It would be useful to see a breakdown of allocations to understand which
>> NUMA policies are driving them. For example, when investigating memory
>> pressure, having policy-specific counts could show that allocations were
>> bound to the affected node (via MPOL_BIND).
>>
>> Add per-policy page allocation counters as new node stat items. These
>> counters can provide correlation between a mempolicy and pressure on a
>> given node.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> [...]
> 
>>   int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
>>   {
>>   	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
>> @@ -2446,8 +2461,14 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>>   
>>   	nodemask = policy_nodemask(gfp, pol, ilx, &nid);
>>   
>> -	if (pol->mode == MPOL_PREFERRED_MANY)
>> -		return alloc_pages_preferred_many(gfp, order, nid, nodemask);
>> +	if (pol->mode == MPOL_PREFERRED_MANY) {
>> +		page = alloc_pages_preferred_many(gfp, order, nid, nodemask);
>> +		if (page)
>> +			__mod_node_page_state(page_pgdat(page),
>> +					mpol_node_stat(MPOL_PREFERRED_MANY), 1 << order);
> 
> Here and two places below, please use mod_node_page_state() instead of
> __mod_node_page_state() as __foo() requires preempt disable or if the
> given stat can be updated in IRQ, then IRQ disable. This code path does
> not do either of that.

Thanks, I also see syzbot flagged this as well. I can make this change
in v2.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12  4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
  2026-02-12  7:29   ` Michal Hocko
  2026-02-12 15:07   ` Shakeel Butt
@ 2026-02-12 15:24   ` Vlastimil Babka
  2026-02-12 21:25     ` JP Kobryn
  2026-02-18  4:25   ` kernel test robot
  3 siblings, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2026-02-12 15:24 UTC (permalink / raw)
  To: JP Kobryn, linux-mm
  Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david,
	eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett,
	linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt,
	muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/12/26 05:51, JP Kobryn wrote:
> It would be useful to see a breakdown of allocations to understand which
> NUMA policies are driving them. For example, when investigating memory
> pressure, having policy-specific counts could show that allocations were
> bound to the affected node (via MPOL_BIND).
> 
> Add per-policy page allocation counters as new node stat items. These
> counters can provide correlation between a mempolicy and pressure on a
> given node.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended
in a way that would capture any missing important details? A counter per
policy type seems exhaustive, but then on one hand it might be not important
to distinguish beetween some of them, and on the other hand it doesn't track
the nodemask anyway.

> ---
>  include/linux/mmzone.h |  9 +++++++++
>  mm/mempolicy.c         | 30 ++++++++++++++++++++++++++++--
>  mm/vmstat.c            |  9 +++++++++
>  3 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fc5d6c88d2f0..762609d5f0af 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -255,6 +255,15 @@ enum node_stat_item {
>  	PGDEMOTE_DIRECT,
>  	PGDEMOTE_KHUGEPAGED,
>  	PGDEMOTE_PROACTIVE,
> +#ifdef CONFIG_NUMA
> +	PGALLOC_MPOL_DEFAULT,
> +	PGALLOC_MPOL_PREFERRED,
> +	PGALLOC_MPOL_BIND,
> +	PGALLOC_MPOL_INTERLEAVE,
> +	PGALLOC_MPOL_LOCAL,
> +	PGALLOC_MPOL_PREFERRED_MANY,
> +	PGALLOC_MPOL_WEIGHTED_INTERLEAVE,
> +#endif
>  #ifdef CONFIG_HUGETLB_PAGE
>  	NR_HUGETLB,
>  #endif
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 68a98ba57882..3c64784af761 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -217,6 +217,21 @@ static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw)
>  		new_iw[nid] /= iw_gcd;
>  }
>  
> +#define CHECK_MPOL_NODE_STAT_OFFSET(mpol) \
> +	BUILD_BUG_ON(PGALLOC_##mpol - mpol != PGALLOC_MPOL_DEFAULT)
> +
> +static enum node_stat_item mpol_node_stat(unsigned short mode)
> +{
> +	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_PREFERRED);
> +	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_BIND);
> +	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_INTERLEAVE);
> +	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_LOCAL);
> +	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_PREFERRED_MANY);
> +	CHECK_MPOL_NODE_STAT_OFFSET(MPOL_WEIGHTED_INTERLEAVE);
> +
> +	return PGALLOC_MPOL_DEFAULT + mode;
> +}
> +
>  int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
>  {
>  	struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL;
> @@ -2446,8 +2461,14 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  
>  	nodemask = policy_nodemask(gfp, pol, ilx, &nid);
>  
> -	if (pol->mode == MPOL_PREFERRED_MANY)
> -		return alloc_pages_preferred_many(gfp, order, nid, nodemask);
> +	if (pol->mode == MPOL_PREFERRED_MANY) {
> +		page = alloc_pages_preferred_many(gfp, order, nid, nodemask);
> +		if (page)
> +			__mod_node_page_state(page_pgdat(page),
> +					mpol_node_stat(MPOL_PREFERRED_MANY), 1 << order);
> +
> +		return page;
> +	}
>  
>  	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
>  	    /* filter "hugepage" allocation, unless from alloc_pages() */
> @@ -2472,6 +2493,9 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  			page = __alloc_frozen_pages_noprof(
>  				gfp | __GFP_THISNODE | __GFP_NORETRY, order,
>  				nid, NULL);
> +			if (page)
> +				__mod_node_page_state(page_pgdat(page),
> +						mpol_node_stat(pol->mode), 1 << order);
>  			if (page || !(gfp & __GFP_DIRECT_RECLAIM))
>  				return page;
>  			/*
> @@ -2484,6 +2508,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  	}
>  
>  	page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask);
> +	if (page)
> +		__mod_node_page_state(page_pgdat(page), mpol_node_stat(pol->mode), 1 << order);
>  
>  	if (unlikely(pol->mode == MPOL_INTERLEAVE ||
>  		     pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) {
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 65de88cdf40e..74e0ddde1e93 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1291,6 +1291,15 @@ const char * const vmstat_text[] = {
>  	[I(PGDEMOTE_DIRECT)]			= "pgdemote_direct",
>  	[I(PGDEMOTE_KHUGEPAGED)]		= "pgdemote_khugepaged",
>  	[I(PGDEMOTE_PROACTIVE)]			= "pgdemote_proactive",
> +#ifdef CONFIG_NUMA
> +	[I(PGALLOC_MPOL_DEFAULT)]		= "pgalloc_mpol_default",
> +	[I(PGALLOC_MPOL_PREFERRED)]		= "pgalloc_mpol_preferred",
> +	[I(PGALLOC_MPOL_BIND)]			= "pgalloc_mpol_bind",
> +	[I(PGALLOC_MPOL_INTERLEAVE)]		= "pgalloc_mpol_interleave",
> +	[I(PGALLOC_MPOL_LOCAL)]			= "pgalloc_mpol_local",
> +	[I(PGALLOC_MPOL_PREFERRED_MANY)]	= "pgalloc_mpol_preferred_many",
> +	[I(PGALLOC_MPOL_WEIGHTED_INTERLEAVE)]	= "pgalloc_mpol_weighted_interleave",
> +#endif
>  #ifdef CONFIG_HUGETLB_PAGE
>  	[I(NR_HUGETLB)]				= "nr_hugetlb",
>  #endif



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12 15:24   ` Vlastimil Babka
@ 2026-02-12 21:25     ` JP Kobryn
  2026-02-13  8:54       ` Vlastimil Babka
  0 siblings, 1 reply; 27+ messages in thread
From: JP Kobryn @ 2026-02-12 21:25 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david,
	eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett,
	linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt,
	muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/12/26 7:24 AM, Vlastimil Babka wrote:
> On 2/12/26 05:51, JP Kobryn wrote:
>> It would be useful to see a breakdown of allocations to understand which
>> NUMA policies are driving them. For example, when investigating memory
>> pressure, having policy-specific counts could show that allocations were
>> bound to the affected node (via MPOL_BIND).
>>
>> Add per-policy page allocation counters as new node stat items. These
>> counters can provide correlation between a mempolicy and pressure on a
>> given node.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended
> in a way that would capture any missing important details? A counter per
> policy type seems exhaustive, but then on one hand it might be not important
> to distinguish beetween some of them, and on the other hand it doesn't track
> the nodemask anyway.

The two patches of the series should complement each other. When
investigating memory pressure, we could identify the affected nodes
(patch 2). Then we can cross-reference the policy-specific stats to find
any correlation (this patch).

I think extending numa_* counters would call for more permutations to
account for the numa stat per policy. I think distinguishing between
MPOL_DEFAULT and MPOL_BIND is meaningful, for example. Am I
understanding your question?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12 21:25     ` JP Kobryn
@ 2026-02-13  8:54       ` Vlastimil Babka
  2026-02-13 19:56         ` JP Kobryn (Meta)
  0 siblings, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2026-02-13  8:54 UTC (permalink / raw)
  To: JP Kobryn, linux-mm
  Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david,
	eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett,
	linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt,
	muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/12/26 22:25, JP Kobryn wrote:
> On 2/12/26 7:24 AM, Vlastimil Babka wrote:
>> On 2/12/26 05:51, JP Kobryn wrote:
>>> It would be useful to see a breakdown of allocations to understand which
>>> NUMA policies are driving them. For example, when investigating memory
>>> pressure, having policy-specific counts could show that allocations were
>>> bound to the affected node (via MPOL_BIND).
>>>
>>> Add per-policy page allocation counters as new node stat items. These
>>> counters can provide correlation between a mempolicy and pressure on a
>>> given node.
>>>
>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>> 
>> Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended
>> in a way that would capture any missing important details? A counter per
>> policy type seems exhaustive, but then on one hand it might be not important
>> to distinguish beetween some of them, and on the other hand it doesn't track
>> the nodemask anyway.
> 
> The two patches of the series should complement each other. When
> investigating memory pressure, we could identify the affected nodes
> (patch 2). Then we can cross-reference the policy-specific stats to find
> any correlation (this patch).
> 
> I think extending numa_* counters would call for more permutations to
> account for the numa stat per policy. I think distinguishing between
> MPOL_DEFAULT and MPOL_BIND is meaningful, for example. Am I

Are there other useful examples or would it be enough to add e.g. a
numa_bind counter to the numa_hit/miss/etc?
What I'm trying to say the level of detail you are trying to add to the
always-on counters seems like more suitable for tracepoints. The counters
should be limited to what's known to be useful and not "everything we are
able to track and possibly could need one day".

> understanding your question?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-13  8:54       ` Vlastimil Babka
@ 2026-02-13 19:56         ` JP Kobryn (Meta)
  0 siblings, 0 replies; 27+ messages in thread
From: JP Kobryn (Meta) @ 2026-02-13 19:56 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david,
	eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett,
	linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt,
	muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/13/26 12:54 AM, Vlastimil Babka wrote:
> On 2/12/26 22:25, JP Kobryn wrote:
>> On 2/12/26 7:24 AM, Vlastimil Babka wrote:
>>> On 2/12/26 05:51, JP Kobryn wrote:
>>>> It would be useful to see a breakdown of allocations to understand which
>>>> NUMA policies are driving them. For example, when investigating memory
>>>> pressure, having policy-specific counts could show that allocations were
>>>> bound to the affected node (via MPOL_BIND).
>>>>
>>>> Add per-policy page allocation counters as new node stat items. These
>>>> counters can provide correlation between a mempolicy and pressure on a
>>>> given node.
>>>>
>>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>>>
>>> Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended
>>> in a way that would capture any missing important details? A counter per
>>> policy type seems exhaustive, but then on one hand it might be not important
>>> to distinguish beetween some of them, and on the other hand it doesn't track
>>> the nodemask anyway.
>>
>> The two patches of the series should complement each other. When
>> investigating memory pressure, we could identify the affected nodes
>> (patch 2). Then we can cross-reference the policy-specific stats to find
>> any correlation (this patch).
>>
>> I think extending numa_* counters would call for more permutations to
>> account for the numa stat per policy. I think distinguishing between
>> MPOL_DEFAULT and MPOL_BIND is meaningful, for example. Am I
> 
> Are there other useful examples or would it be enough to add e.g. a
> numa_bind counter to the numa_hit/miss/etc?

Aside from bind, it's worth emphasizing that with default policy
tracking we could see if the local node is the source of pressure. In
the interleave case, we would be able to see if the loads are being
balanced or, in the weighted case, being distributed properly.

On extending the numa stats instead, I looked into this some more. I'm
not sure if they're a good fit. They seem more about whether the
allocator succeeded at placement rather than which policy drove the
allocation. Thoughts?

> What I'm trying to say the level of detail you are trying to add to the
> always-on counters seems like more suitable for tracepoints. The counters
> should be limited to what's known to be useful and not "everything we are
> able to track and possibly could need one day".
In a triage scenario, having the stats collected up to the time of the
reported issue would be better. We make use of the tool called below[0].
It periodically samples the system and allows us to view the
historical state prior to the issue. If we started at the time of the
incident and attached tracepoints it would be too late.

The triage workflow would look like this:
1) Pressure/OOMs reported while system-wide memory is free.
2) Check per-node pgscan/pgsteal stats (provided by patch 2) to narrow
down node(s) under pressure.
3) Check per-policy allocation counters (this patch) on that node to
find what policy was driving it.

[0] https://github.com/facebookincubator/below


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
  2026-02-12  4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
                     ` (2 preceding siblings ...)
  2026-02-12 15:24   ` Vlastimil Babka
@ 2026-02-18  4:25   ` kernel test robot
  3 siblings, 0 replies; 27+ messages in thread
From: kernel test robot @ 2026-02-18  4:25 UTC (permalink / raw)
  To: JP Kobryn
  Cc: oe-lkp, lkp, Johannes Weiner, linux-mm, apopple, akpm,
	axelrasmussen, byungchul, cgroups, david, eperezma, gourry,
	jasowang, joshua.hahnjy, Liam.Howlett, linux-kernel,
	lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song,
	zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb,
	virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu,
	ziy, kernel-team, oliver.sang



Hello,

kernel test robot noticed "RIP:__mod_node_page_state" on:

commit: 4b5f69459c0988d3b292aceb74633e04eea84c7f ("[PATCH 1/2] mm/mempolicy: track page allocations per mempolicy")
url: https://github.com/intel-lab-lkp/linux/commits/JP-Kobryn/mm-mempolicy-track-page-allocations-per-mempolicy/20260212-142941
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20260212045109.255391-2-inwardvessel@gmail.com/
patch subject: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy

in testcase: boot

config: x86_64-randconfig-007-20250327
compiler: gcc-14
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+------------------------------------------------------------------+------------+------------+
|                                                                  | 5cbf93e36f | 4b5f69459c |
+------------------------------------------------------------------+------------+------------+
| boot_successes                                                   | 244        | 0          |
| boot_failures                                                    | 0          | 244        |
| RIP:__mod_node_page_state                                        | 0          | 244        |
| BUG:using__this_cpu_read()in_preemptible                         | 0          | 244        |
| BUG:using__this_cpu_write()in_preemptible[#]code:kthreadd        | 0          | 244        |
| BUG:using__this_cpu_write()in_preemptible[#]code:swapper         | 0          | 187        |
| BUG:using__this_cpu_write()in_preemptible[#]code:kdevtmpfs       | 0          | 79         |
| BUG:using__this_cpu_write()in_preemptible[#]code:kworker/u8      | 0          | 229        |
| BUG:using__this_cpu_write()in_preemptible[#]code:udevd           | 0          | 62         |
| BUG:using__this_cpu_write()in_preemptible[#]code:tail            | 0          | 21         |
| BUG:using__this_cpu_write()in_preemptible[#]code:syslogd         | 0          | 54         |
| BUG:using__this_cpu_write()in_preemptible[#]code:klogd           | 0          | 113        |
| BUG:using__this_cpu_write()in_preemptible[#]code:sleep           | 0          | 98         |
| BUG:using__this_cpu_write()in_preemptible[#]code:post-run        | 0          | 39         |
| BUG:using__this_cpu_write()in_preemptible[#]code:rsync           | 0          | 9          |
| BUG:using__this_cpu_write()in_preemptible[#]code:modprobe        | 0          | 6          |
| BUG:using__this_cpu_write()in_preemptible[#]code                 | 0          | 32         |
| BUG:using__this_cpu_write()in_preemptible[#]code:udevadm         | 0          | 78         |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd         | 0          | 39         |
| BUG:using__this_cpu_write()in_preemptible[#]code:(udev-worker)   | 0          | 53         |
| RIP:rep_movs_alternative                                         | 0          | 5          |
| BUG:using__this_cpu_write()in_preemptible[#]code:cat             | 0          | 7          |
| BUG:using__this_cpu_write()in_preemptible[#]code:sed             | 0          | 98         |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-udevd   | 0          | 19         |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-journal | 0          | 54         |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-random  | 0          | 4          |
| BUG:using__this_cpu_write()in_preemptible[#]code:journalctl      | 0          | 8          |
| BUG:using__this_cpu_write()in_preemptible[#]code:start_getty     | 0          | 4          |
| RIP:__put_user_4                                                 | 0          | 24         |
| BUG:using__this_cpu_write()in_preemptible[#]code:wget            | 0          | 82         |
| BUG:using__this_cpu_write()in_preemptible[#]code:run-lkp         | 0          | 32         |
| BUG:using__this_cpu_write()in_preemptible[#]code:boot-#-yocto-i3 | 0          | 24         |
| BUG:using__this_cpu_write()in_preemptible[#]code:one-shot-monito | 0          | 4          |
| BUG:using__this_cpu_write()in_preemptible[#]code:vmstat          | 0          | 29         |
| BUG:using__this_cpu_write()in_preemptible[#]code:rs:main_Q:Reg   | 0          | 9          |
| RIP:rep_stos_alternative                                         | 0          | 11         |
| BUG:using__this_cpu_write()in_preemptible[#]code:lkp-setup-rootf | 0          | 21         |
| BUG:using__this_cpu_write()in_preemptible[#]code:stty            | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:tee             | 0          | 7          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-rc-loca | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:(exec-inner)    | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:groupadd        | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:(sd-exec-strv)  | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:rc              | 0          | 14         |
| BUG:using__this_cpu_write()in_preemptible[#]code:getty           | 0          | 18         |
| BUG:using__this_cpu_write()in_preemptible[#]code:boot-#-debian   | 0          | 4          |
| BUG:using__this_cpu_write()in_preemptible[#]code:monitor         | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-tmpfile | 0          | 6          |
| BUG:using__this_cpu_write()in_preemptible[#]code:lscpu           | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:dirname         | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-sysuser | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:(d-sysctl)      | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:mount           | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:ls              | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:pgrep           | 0          | 4          |
| BUG:using__this_cpu_write()in_preemptible[#]code:grep            | 0          | 8          |
| BUG:using__this_cpu_write()in_preemptible[#]code:S77lkp-bootstra | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:date            | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-sysctl  | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:find            | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:sshd            | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-system  | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-sysv-ge | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-hiberna | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:journal-offline | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:sysctl          | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:init            | 0          | 7          |
| BUG:using__this_cpu_write()in_preemptible[#]code:mkdir           | 0          | 6          |
| BUG:using__this_cpu_write()in_preemptible[#]code:mountpoint      | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-logind  | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:dmesg           | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-ssh-gen | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:cp              | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:wakeup          | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:dpkg-deb        | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:dpkg            | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:(modprobe)      | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:sync            | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-update  | 0          | 4          |
| BUG:using__this_cpu_write()in_preemptible[#]code:kmod            | 0          | 1          |
| RIP:strncpy_from_user                                            | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:sm-notify       | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-remount | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:blkmapd         | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:mkfifo          | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:ln              | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:sh              | 0          | 5          |
| BUG:using__this_cpu_write()in_preemptible[#]code:bootlogd        | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:run-test        | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:S07bootlogd     | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:hwclock.sh      | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:(sd-mkdcreds)   | 0          | 1          |
| RIP:filldir64                                                    | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:chmod           | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:ps              | 0          | 3          |
| BUG:using__this_cpu_write()in_preemptible[#]code:which           | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:ip              | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:start-stop-daem | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:S20syslog       | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-gpt-aut | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-debug-g | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:(rpcbind)       | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:seq             | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-run-gen | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:wait            | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:addgroup        | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:rm              | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:in:imklog       | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:basename        | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:touch           | 0          | 1          |
| RIP:ia32_setup_frame                                             | 0          | 2          |
| BUG:using__this_cpu_write()in_preemptible[#]code:no-stdout-monit | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:systemd-tpm#-ge | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:(mount)         | 0          | 1          |
| BUG:using__this_cpu_write()in_preemptible[#]code:ldconfig        | 0          | 1          |
+------------------------------------------------------------------+------------+------------+

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202602181136.f66ba888-lkp@intel.com



[    0.624787][    T2] ------------[ cut here ]------------
[    0.625191][    T2] WARNING: mm/vmstat.c:396 at __mod_node_page_state+0x88/0x1c0, CPU#0: kthreadd/2
[    0.625887][    T2] Modules linked in:
[    0.626070][    T2] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Tainted: G                T   6.19.0-rc6-00596-g4b5f69459c09 #1 PREEMPT(lazy)  a55f7fce8adbfb8e52612c1f0ea71f4db1a1df23
[    0.626084][    T2] Tainted: [T]=RANDSTRUCT
[    0.626402][    T2] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[    0.627150][    T2] RIP: 0010:__mod_node_page_state (mm/vmstat.c:396 (discriminator 34))
[    0.627592][    T2] Code: 8b 05 88 b9 73 02 48 c7 c7 d8 b0 b4 83 85 c0 89 45 d0 40 0f 95 c6 31 c9 31 d2 40 0f b6 f6 e8 3f 96 e4 ff 8b 45 d0 85 c0 74 1b <0f> 0b be 01 00 00 00 eb 14 31 c9 31 d2 31 f6 48 c7 c7 d8 b0 b4 83
All code
========
   0:	8b 05 88 b9 73 02    	mov    0x273b988(%rip),%eax        # 0x273b98e
   6:	48 c7 c7 d8 b0 b4 83 	mov    $0xffffffff83b4b0d8,%rdi
   d:	85 c0                	test   %eax,%eax
   f:	89 45 d0             	mov    %eax,-0x30(%rbp)
  12:	40 0f 95 c6          	setne  %sil
  16:	31 c9                	xor    %ecx,%ecx
  18:	31 d2                	xor    %edx,%edx
  1a:	40 0f b6 f6          	movzbl %sil,%esi
  1e:	e8 3f 96 e4 ff       	call   0xffffffffffe49662
  23:	8b 45 d0             	mov    -0x30(%rbp),%eax
  26:	85 c0                	test   %eax,%eax
  28:	74 1b                	je     0x45
  2a:*	0f 0b                	ud2		<-- trapping instruction
  2c:	be 01 00 00 00       	mov    $0x1,%esi
  31:	eb 14                	jmp    0x47
  33:	31 c9                	xor    %ecx,%ecx
  35:	31 d2                	xor    %edx,%edx
  37:	31 f6                	xor    %esi,%esi
  39:	48 c7 c7 d8 b0 b4 83 	mov    $0xffffffff83b4b0d8,%rdi

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2
   2:	be 01 00 00 00       	mov    $0x1,%esi
   7:	eb 14                	jmp    0x1d
   9:	31 c9                	xor    %ecx,%ecx
   b:	31 d2                	xor    %edx,%edx
   d:	31 f6                	xor    %esi,%esi
   f:	48 c7 c7 d8 b0 b4 83 	mov    $0xffffffff83b4b0d8,%rdi
[    0.629418][    T2] RSP: 0000:ffff88810039fa20 EFLAGS: 00010202
[    0.629869][    T2] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000000
[    0.630445][    T2] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[    0.631089][    T2] RBP: ffff88810039fa50 R08: 0000000000000000 R09: 0000000000000000
[    0.631671][    T2] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88883ffe02c0
[    0.632247][    T2] R13: ffffffff83f18971 R14: ffffffff83f18940 R15: 0000000000000030
[    0.632746][    T2] FS:  0000000000000000(0000) GS:ffff88889bd1c000(0000) knlGS:0000000000000000
[    0.633394][    T2] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.633875][    T2] CR2: ffff88883ffff000 CR3: 000000000343d000 CR4: 00000000000406b0
[    0.634478][    T2] Call Trace:
[    0.634723][    T2]  <TASK>
[    0.634951][    T2]  alloc_pages_mpol (mm/mempolicy.c:2513 (discriminator 1))
[    0.635326][    T2]  alloc_frozen_pages_noprof (mm/mempolicy.c:2584)
[    0.635746][    T2]  allocate_slab (mm/slub.c:3075 (discriminator 2) mm/slub.c:3248 (discriminator 2))
[    0.636086][    T2]  new_slab (mm/slub.c:3304)
[    0.636394][    T2]  ___slab_alloc (mm/slub.c:4657)
[    0.636749][    T2]  ? dup_task_struct (kernel/fork.c:184 (discriminator 2) kernel/fork.c:915 (discriminator 2))
[    0.637114][    T2]  __slab_alloc+0x8a/0x180
[    0.637519][    T2]  slab_alloc_node+0x189/0x340
[    0.637919][    T2]  ? dup_task_struct (kernel/fork.c:184 (discriminator 2) kernel/fork.c:915 (discriminator 2))
[    0.638285][    T2]  kmem_cache_alloc_node_noprof (mm/slub.c:5317 (discriminator 1))
[    0.638710][    T2]  dup_task_struct (kernel/fork.c:184 (discriminator 2) kernel/fork.c:915 (discriminator 2))
[    0.639058][    T2]  ? ftrace_likely_update (arch/x86/include/asm/smap.h:90 kernel/trace/trace_branch.c:223)
[    0.639416][    T2]  copy_process (kernel/fork.c:2052 (discriminator 1))
[    0.639773][    T2]  kernel_clone (include/linux/random.h:26 kernel/fork.c:2652)
[    0.640115][    T2]  ? kthread_fetch_affinity (kernel/kthread.c:412)
[    0.640552][    T2]  kernel_thread (kernel/fork.c:2713)
[    0.640892][    T2]  ? kthread_fetch_affinity (kernel/kthread.c:412)
[    0.641310][    T2]  kthreadd (kernel/kthread.c:486 kernel/kthread.c:844)
[    0.641621][    T2]  ? kthreadd (kernel/kthread.c:830 (discriminator 5))
[    0.641938][    T2]  ? kthread_is_per_cpu (kernel/kthread.c:816)
[    0.642316][    T2]  ret_from_fork (arch/x86/kernel/process.c:164)
[    0.642657][    T2]  ? kthread_is_per_cpu (kernel/kthread.c:816)
[    0.642744][    T2]  ? kthread_is_per_cpu (kernel/kthread.c:816)
[    0.643127][    T2]  ret_from_fork_asm (arch/x86/entry/entry_64.S:256)
[    0.643502][    T2]  </TASK>
[    0.643755][    T2] irq event stamp: 393
[    0.644054][    T2] hardirqs last  enabled at (401): __up_console_sem (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/printk/printk.c:345)
[    0.644730][    T2] hardirqs last disabled at (408): __up_console_sem (kernel/printk/printk.c:343 (discriminator 3))
[    0.645406][    T2] softirqs last  enabled at (54): handle_softirqs (kernel/softirq.c:469 (discriminator 1) kernel/softirq.c:650 (discriminator 1))
[    0.646077][    T2] softirqs last disabled at (49): __irq_exit_rcu (kernel/softirq.c:657 kernel/softirq.c:496 kernel/softirq.c:723)
[    0.646741][    T2] ---[ end trace 0000000000000000 ]---


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260218/202602181136.f66ba888-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/2] mm: move pgscan and pgsteal to node stats
  2026-02-12  4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn
  2026-02-12  4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
@ 2026-02-12  4:51 ` JP Kobryn
  2026-02-12  7:08   ` Michael S. Tsirkin
  2026-02-12  7:29   ` Michal Hocko
  2026-02-12  4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox
  2026-02-12 18:08 ` [syzbot ci] " syzbot ci
  3 siblings, 2 replies; 27+ messages in thread
From: JP Kobryn @ 2026-02-12  4:51 UTC (permalink / raw)
  To: linux-mm
  Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david,
	eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett,
	linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt,
	muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

It would be useful to narrow down reclaim to specific nodes.

Provide per-node reclaim visibility by changing the pgscan and pgsteal
stats from global vm_event_item's to node_stat_item's. Note this change has
the side effect of now tracking these stats on a per-memcg basis.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
---
 drivers/virtio/virtio_balloon.c |  8 ++++----
 include/linux/mmzone.h          | 12 +++++++++++
 include/linux/vm_event_item.h   | 12 -----------
 mm/memcontrol.c                 | 36 ++++++++++++++++++---------------
 mm/vmscan.c                     | 32 +++++++++++------------------
 mm/vmstat.c                     | 24 +++++++++++-----------
 6 files changed, 60 insertions(+), 64 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 74fe59f5a78c..1341d9d1a2a1 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -374,13 +374,13 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb)
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
 
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,
-		    pages_to_bytes(events[PGSCAN_KSWAPD]));
+		    pages_to_bytes(global_node_page_state(PGSCAN_KSWAPD)));
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
-		    pages_to_bytes(events[PGSCAN_DIRECT]));
+		    pages_to_bytes(global_node_page_state(PGSCAN_DIRECT)));
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
-		    pages_to_bytes(events[PGSTEAL_KSWAPD]));
+		    pages_to_bytes(global_node_page_state(PGSTEAL_KSWAPD)));
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
-		    pages_to_bytes(events[PGSTEAL_DIRECT]));
+		    pages_to_bytes(global_node_page_state(PGSTEAL_DIRECT)));
 
 #ifdef CONFIG_HUGETLB_PAGE
 	update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 762609d5f0af..fc39c107a4b5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -255,6 +255,18 @@ enum node_stat_item {
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
 	PGDEMOTE_PROACTIVE,
+	PGSTEAL_KSWAPD,
+	PGSTEAL_DIRECT,
+	PGSTEAL_KHUGEPAGED,
+	PGSTEAL_PROACTIVE,
+	PGSTEAL_ANON,
+	PGSTEAL_FILE,
+	PGSCAN_KSWAPD,
+	PGSCAN_DIRECT,
+	PGSCAN_KHUGEPAGED,
+	PGSCAN_PROACTIVE,
+	PGSCAN_ANON,
+	PGSCAN_FILE,
 #ifdef CONFIG_NUMA
 	PGALLOC_MPOL_DEFAULT,
 	PGALLOC_MPOL_PREFERRED,
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 92f80b4d69a6..6f1787680658 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -40,19 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGLAZYFREED,
 		PGREFILL,
 		PGREUSE,
-		PGSTEAL_KSWAPD,
-		PGSTEAL_DIRECT,
-		PGSTEAL_KHUGEPAGED,
-		PGSTEAL_PROACTIVE,
-		PGSCAN_KSWAPD,
-		PGSCAN_DIRECT,
-		PGSCAN_KHUGEPAGED,
-		PGSCAN_PROACTIVE,
 		PGSCAN_DIRECT_THROTTLE,
-		PGSCAN_ANON,
-		PGSCAN_FILE,
-		PGSTEAL_ANON,
-		PGSTEAL_FILE,
 #ifdef CONFIG_NUMA
 		PGSCAN_ZONE_RECLAIM_SUCCESS,
 		PGSCAN_ZONE_RECLAIM_FAILED,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 86f43b7e5f71..bde0b6536be6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -328,6 +328,18 @@ static const unsigned int memcg_node_stat_items[] = {
 	PGDEMOTE_DIRECT,
 	PGDEMOTE_KHUGEPAGED,
 	PGDEMOTE_PROACTIVE,
+	PGSTEAL_KSWAPD,
+	PGSTEAL_DIRECT,
+	PGSTEAL_KHUGEPAGED,
+	PGSTEAL_PROACTIVE,
+	PGSTEAL_ANON,
+	PGSTEAL_FILE,
+	PGSCAN_KSWAPD,
+	PGSCAN_DIRECT,
+	PGSCAN_KHUGEPAGED,
+	PGSCAN_PROACTIVE,
+	PGSCAN_ANON,
+	PGSCAN_FILE,
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
@@ -441,14 +453,6 @@ static const unsigned int memcg_vm_event_stat[] = {
 #endif
 	PSWPIN,
 	PSWPOUT,
-	PGSCAN_KSWAPD,
-	PGSCAN_DIRECT,
-	PGSCAN_KHUGEPAGED,
-	PGSCAN_PROACTIVE,
-	PGSTEAL_KSWAPD,
-	PGSTEAL_DIRECT,
-	PGSTEAL_KHUGEPAGED,
-	PGSTEAL_PROACTIVE,
 	PGFAULT,
 	PGMAJFAULT,
 	PGREFILL,
@@ -1496,15 +1500,15 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	/* Accumulated memory events */
 	seq_buf_printf(s, "pgscan %lu\n",
-		       memcg_events(memcg, PGSCAN_KSWAPD) +
-		       memcg_events(memcg, PGSCAN_DIRECT) +
-		       memcg_events(memcg, PGSCAN_PROACTIVE) +
-		       memcg_events(memcg, PGSCAN_KHUGEPAGED));
+		       memcg_page_state(memcg, PGSCAN_KSWAPD) +
+		       memcg_page_state(memcg, PGSCAN_DIRECT) +
+		       memcg_page_state(memcg, PGSCAN_PROACTIVE) +
+		       memcg_page_state(memcg, PGSCAN_KHUGEPAGED));
 	seq_buf_printf(s, "pgsteal %lu\n",
-		       memcg_events(memcg, PGSTEAL_KSWAPD) +
-		       memcg_events(memcg, PGSTEAL_DIRECT) +
-		       memcg_events(memcg, PGSTEAL_PROACTIVE) +
-		       memcg_events(memcg, PGSTEAL_KHUGEPAGED));
+		       memcg_page_state(memcg, PGSTEAL_KSWAPD) +
+		       memcg_page_state(memcg, PGSTEAL_DIRECT) +
+		       memcg_page_state(memcg, PGSTEAL_PROACTIVE) +
+		       memcg_page_state(memcg, PGSTEAL_KHUGEPAGED));
 
 	for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) {
 #ifdef CONFIG_MEMCG_V1
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 614ccf39fe3f..16a0f21e3ea1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1977,7 +1977,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	unsigned long nr_taken;
 	struct reclaim_stat stat;
 	bool file = is_file_lru(lru);
-	enum vm_event_item item;
+	enum node_stat_item item;
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	bool stalled = false;
 
@@ -2003,10 +2003,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
-	if (!cgroup_reclaim(sc))
-		__count_vm_events(item, nr_scanned);
-	count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
-	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
+	mod_lruvec_state(lruvec, item, nr_scanned);
+	mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned);
 
 	spin_unlock_irq(&lruvec->lru_lock);
 
@@ -2023,10 +2021,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 					stat.nr_demoted);
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
-	if (!cgroup_reclaim(sc))
-		__count_vm_events(item, nr_reclaimed);
-	count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
-	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
+	mod_lruvec_state(lruvec, item, nr_reclaimed);
+	mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed);
 
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
@@ -4536,7 +4532,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 {
 	int i;
 	int gen;
-	enum vm_event_item item;
+	enum node_stat_item item;
 	int sorted = 0;
 	int scanned = 0;
 	int isolated = 0;
@@ -4595,13 +4591,11 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	}
 
 	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
-	if (!cgroup_reclaim(sc)) {
-		__count_vm_events(item, isolated);
+	if (!cgroup_reclaim(sc))
 		__count_vm_events(PGREFILL, sorted);
-	}
-	count_memcg_events(memcg, item, isolated);
+	mod_lruvec_state(lruvec, item, isolated);
 	count_memcg_events(memcg, PGREFILL, sorted);
-	__count_vm_events(PGSCAN_ANON + type, isolated);
+	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
@@ -4686,7 +4680,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	LIST_HEAD(clean);
 	struct folio *folio;
 	struct folio *next;
-	enum vm_event_item item;
+	enum node_stat_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
 	bool skip_retry = false;
@@ -4750,10 +4744,8 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 					stat.nr_demoted);
 
 	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
-	if (!cgroup_reclaim(sc))
-		__count_vm_events(item, reclaimed);
-	count_memcg_events(memcg, item, reclaimed);
-	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
+	mod_lruvec_state(lruvec, item, reclaimed);
+	mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed);
 
 	spin_unlock_irq(&lruvec->lru_lock);
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 74e0ddde1e93..e4b259989d58 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1291,6 +1291,18 @@ const char * const vmstat_text[] = {
 	[I(PGDEMOTE_DIRECT)]			= "pgdemote_direct",
 	[I(PGDEMOTE_KHUGEPAGED)]		= "pgdemote_khugepaged",
 	[I(PGDEMOTE_PROACTIVE)]			= "pgdemote_proactive",
+	[I(PGSTEAL_KSWAPD)]			= "pgsteal_kswapd",
+	[I(PGSTEAL_DIRECT)]			= "pgsteal_direct",
+	[I(PGSTEAL_KHUGEPAGED)]			= "pgsteal_khugepaged",
+	[I(PGSTEAL_PROACTIVE)]			= "pgsteal_proactive",
+	[I(PGSTEAL_ANON)]			= "pgsteal_anon",
+	[I(PGSTEAL_FILE)]			= "pgsteal_file",
+	[I(PGSCAN_KSWAPD)]			= "pgscan_kswapd",
+	[I(PGSCAN_DIRECT)]			= "pgscan_direct",
+	[I(PGSCAN_KHUGEPAGED)]			= "pgscan_khugepaged",
+	[I(PGSCAN_PROACTIVE)]			= "pgscan_proactive",
+	[I(PGSCAN_ANON)]			= "pgscan_anon",
+	[I(PGSCAN_FILE)]			= "pgscan_file",
 #ifdef CONFIG_NUMA
 	[I(PGALLOC_MPOL_DEFAULT)]		= "pgalloc_mpol_default",
 	[I(PGALLOC_MPOL_PREFERRED)]		= "pgalloc_mpol_preferred",
@@ -1344,19 +1356,7 @@ const char * const vmstat_text[] = {
 
 	[I(PGREFILL)]				= "pgrefill",
 	[I(PGREUSE)]				= "pgreuse",
-	[I(PGSTEAL_KSWAPD)]			= "pgsteal_kswapd",
-	[I(PGSTEAL_DIRECT)]			= "pgsteal_direct",
-	[I(PGSTEAL_KHUGEPAGED)]			= "pgsteal_khugepaged",
-	[I(PGSTEAL_PROACTIVE)]			= "pgsteal_proactive",
-	[I(PGSCAN_KSWAPD)]			= "pgscan_kswapd",
-	[I(PGSCAN_DIRECT)]			= "pgscan_direct",
-	[I(PGSCAN_KHUGEPAGED)]			= "pgscan_khugepaged",
-	[I(PGSCAN_PROACTIVE)]			= "pgscan_proactive",
 	[I(PGSCAN_DIRECT_THROTTLE)]		= "pgscan_direct_throttle",
-	[I(PGSCAN_ANON)]			= "pgscan_anon",
-	[I(PGSCAN_FILE)]			= "pgscan_file",
-	[I(PGSTEAL_ANON)]			= "pgsteal_anon",
-	[I(PGSTEAL_FILE)]			= "pgsteal_file",
 
 #ifdef CONFIG_NUMA
 	[I(PGSCAN_ZONE_RECLAIM_SUCCESS)]	= "zone_reclaim_success",
-- 
2.47.3



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm: move pgscan and pgsteal to node stats
  2026-02-12  4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn
@ 2026-02-12  7:08   ` Michael S. Tsirkin
  2026-02-12 21:23     ` JP Kobryn
  2026-02-12  7:29   ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Michael S. Tsirkin @ 2026-02-12  7:08 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost,
	mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim,
	roman.gushchin, shakeel.butt, surenb, virtualization, vbabka,
	weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team

On Wed, Feb 11, 2026 at 08:51:09PM -0800, JP Kobryn wrote:
> It would be useful to narrow down reclaim to specific nodes.
> 
> Provide per-node reclaim visibility by changing the pgscan and pgsteal
> stats from global vm_event_item's to node_stat_item's. Note this change has
> the side effect of now tracking these stats on a per-memcg basis.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

virtio_balloon changes

> ---
>  drivers/virtio/virtio_balloon.c |  8 ++++----
>  include/linux/mmzone.h          | 12 +++++++++++
>  include/linux/vm_event_item.h   | 12 -----------
>  mm/memcontrol.c                 | 36 ++++++++++++++++++---------------
>  mm/vmscan.c                     | 32 +++++++++++------------------
>  mm/vmstat.c                     | 24 +++++++++++-----------
>  6 files changed, 60 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 74fe59f5a78c..1341d9d1a2a1 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -374,13 +374,13 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb)
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
>  
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,
> -		    pages_to_bytes(events[PGSCAN_KSWAPD]));
> +		    pages_to_bytes(global_node_page_state(PGSCAN_KSWAPD)));
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
> -		    pages_to_bytes(events[PGSCAN_DIRECT]));
> +		    pages_to_bytes(global_node_page_state(PGSCAN_DIRECT)));
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
> -		    pages_to_bytes(events[PGSTEAL_KSWAPD]));
> +		    pages_to_bytes(global_node_page_state(PGSTEAL_KSWAPD)));
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
> -		    pages_to_bytes(events[PGSTEAL_DIRECT]));
> +		    pages_to_bytes(global_node_page_state(PGSTEAL_DIRECT)));
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 762609d5f0af..fc39c107a4b5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -255,6 +255,18 @@ enum node_stat_item {
>  	PGDEMOTE_DIRECT,
>  	PGDEMOTE_KHUGEPAGED,
>  	PGDEMOTE_PROACTIVE,
> +	PGSTEAL_KSWAPD,
> +	PGSTEAL_DIRECT,
> +	PGSTEAL_KHUGEPAGED,
> +	PGSTEAL_PROACTIVE,
> +	PGSTEAL_ANON,
> +	PGSTEAL_FILE,
> +	PGSCAN_KSWAPD,
> +	PGSCAN_DIRECT,
> +	PGSCAN_KHUGEPAGED,
> +	PGSCAN_PROACTIVE,
> +	PGSCAN_ANON,
> +	PGSCAN_FILE,
>  #ifdef CONFIG_NUMA
>  	PGALLOC_MPOL_DEFAULT,
>  	PGALLOC_MPOL_PREFERRED,
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 92f80b4d69a6..6f1787680658 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -40,19 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		PGLAZYFREED,
>  		PGREFILL,
>  		PGREUSE,
> -		PGSTEAL_KSWAPD,
> -		PGSTEAL_DIRECT,
> -		PGSTEAL_KHUGEPAGED,
> -		PGSTEAL_PROACTIVE,
> -		PGSCAN_KSWAPD,
> -		PGSCAN_DIRECT,
> -		PGSCAN_KHUGEPAGED,
> -		PGSCAN_PROACTIVE,
>  		PGSCAN_DIRECT_THROTTLE,
> -		PGSCAN_ANON,
> -		PGSCAN_FILE,
> -		PGSTEAL_ANON,
> -		PGSTEAL_FILE,
>  #ifdef CONFIG_NUMA
>  		PGSCAN_ZONE_RECLAIM_SUCCESS,
>  		PGSCAN_ZONE_RECLAIM_FAILED,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 86f43b7e5f71..bde0b6536be6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -328,6 +328,18 @@ static const unsigned int memcg_node_stat_items[] = {
>  	PGDEMOTE_DIRECT,
>  	PGDEMOTE_KHUGEPAGED,
>  	PGDEMOTE_PROACTIVE,
> +	PGSTEAL_KSWAPD,
> +	PGSTEAL_DIRECT,
> +	PGSTEAL_KHUGEPAGED,
> +	PGSTEAL_PROACTIVE,
> +	PGSTEAL_ANON,
> +	PGSTEAL_FILE,
> +	PGSCAN_KSWAPD,
> +	PGSCAN_DIRECT,
> +	PGSCAN_KHUGEPAGED,
> +	PGSCAN_PROACTIVE,
> +	PGSCAN_ANON,
> +	PGSCAN_FILE,
>  #ifdef CONFIG_HUGETLB_PAGE
>  	NR_HUGETLB,
>  #endif
> @@ -441,14 +453,6 @@ static const unsigned int memcg_vm_event_stat[] = {
>  #endif
>  	PSWPIN,
>  	PSWPOUT,
> -	PGSCAN_KSWAPD,
> -	PGSCAN_DIRECT,
> -	PGSCAN_KHUGEPAGED,
> -	PGSCAN_PROACTIVE,
> -	PGSTEAL_KSWAPD,
> -	PGSTEAL_DIRECT,
> -	PGSTEAL_KHUGEPAGED,
> -	PGSTEAL_PROACTIVE,
>  	PGFAULT,
>  	PGMAJFAULT,
>  	PGREFILL,
> @@ -1496,15 +1500,15 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>  
>  	/* Accumulated memory events */
>  	seq_buf_printf(s, "pgscan %lu\n",
> -		       memcg_events(memcg, PGSCAN_KSWAPD) +
> -		       memcg_events(memcg, PGSCAN_DIRECT) +
> -		       memcg_events(memcg, PGSCAN_PROACTIVE) +
> -		       memcg_events(memcg, PGSCAN_KHUGEPAGED));
> +		       memcg_page_state(memcg, PGSCAN_KSWAPD) +
> +		       memcg_page_state(memcg, PGSCAN_DIRECT) +
> +		       memcg_page_state(memcg, PGSCAN_PROACTIVE) +
> +		       memcg_page_state(memcg, PGSCAN_KHUGEPAGED));
>  	seq_buf_printf(s, "pgsteal %lu\n",
> -		       memcg_events(memcg, PGSTEAL_KSWAPD) +
> -		       memcg_events(memcg, PGSTEAL_DIRECT) +
> -		       memcg_events(memcg, PGSTEAL_PROACTIVE) +
> -		       memcg_events(memcg, PGSTEAL_KHUGEPAGED));
> +		       memcg_page_state(memcg, PGSTEAL_KSWAPD) +
> +		       memcg_page_state(memcg, PGSTEAL_DIRECT) +
> +		       memcg_page_state(memcg, PGSTEAL_PROACTIVE) +
> +		       memcg_page_state(memcg, PGSTEAL_KHUGEPAGED));
>  
>  	for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) {
>  #ifdef CONFIG_MEMCG_V1
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 614ccf39fe3f..16a0f21e3ea1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1977,7 +1977,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  	unsigned long nr_taken;
>  	struct reclaim_stat stat;
>  	bool file = is_file_lru(lru);
> -	enum vm_event_item item;
> +	enum node_stat_item item;
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	bool stalled = false;
>  
> @@ -2003,10 +2003,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>  	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
> -	if (!cgroup_reclaim(sc))
> -		__count_vm_events(item, nr_scanned);
> -	count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> -	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
> +	mod_lruvec_state(lruvec, item, nr_scanned);
> +	mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned);
>  
>  	spin_unlock_irq(&lruvec->lru_lock);
>  
> @@ -2023,10 +2021,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  					stat.nr_demoted);
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>  	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
> -	if (!cgroup_reclaim(sc))
> -		__count_vm_events(item, nr_reclaimed);
> -	count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> -	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> +	mod_lruvec_state(lruvec, item, nr_reclaimed);
> +	mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed);
>  
>  	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>  					nr_scanned - nr_reclaimed);
> @@ -4536,7 +4532,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  {
>  	int i;
>  	int gen;
> -	enum vm_event_item item;
> +	enum node_stat_item item;
>  	int sorted = 0;
>  	int scanned = 0;
>  	int isolated = 0;
> @@ -4595,13 +4591,11 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	}
>  
>  	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
> -	if (!cgroup_reclaim(sc)) {
> -		__count_vm_events(item, isolated);
> +	if (!cgroup_reclaim(sc))
>  		__count_vm_events(PGREFILL, sorted);
> -	}
> -	count_memcg_events(memcg, item, isolated);
> +	mod_lruvec_state(lruvec, item, isolated);
>  	count_memcg_events(memcg, PGREFILL, sorted);
> -	__count_vm_events(PGSCAN_ANON + type, isolated);
> +	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
>  				scanned, skipped, isolated,
>  				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> @@ -4686,7 +4680,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	LIST_HEAD(clean);
>  	struct folio *folio;
>  	struct folio *next;
> -	enum vm_event_item item;
> +	enum node_stat_item item;
>  	struct reclaim_stat stat;
>  	struct lru_gen_mm_walk *walk;
>  	bool skip_retry = false;
> @@ -4750,10 +4744,8 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  					stat.nr_demoted);
>  
>  	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
> -	if (!cgroup_reclaim(sc))
> -		__count_vm_events(item, reclaimed);
> -	count_memcg_events(memcg, item, reclaimed);
> -	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
> +	mod_lruvec_state(lruvec, item, reclaimed);
> +	mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed);
>  
>  	spin_unlock_irq(&lruvec->lru_lock);
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 74e0ddde1e93..e4b259989d58 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1291,6 +1291,18 @@ const char * const vmstat_text[] = {
>  	[I(PGDEMOTE_DIRECT)]			= "pgdemote_direct",
>  	[I(PGDEMOTE_KHUGEPAGED)]		= "pgdemote_khugepaged",
>  	[I(PGDEMOTE_PROACTIVE)]			= "pgdemote_proactive",
> +	[I(PGSTEAL_KSWAPD)]			= "pgsteal_kswapd",
> +	[I(PGSTEAL_DIRECT)]			= "pgsteal_direct",
> +	[I(PGSTEAL_KHUGEPAGED)]			= "pgsteal_khugepaged",
> +	[I(PGSTEAL_PROACTIVE)]			= "pgsteal_proactive",
> +	[I(PGSTEAL_ANON)]			= "pgsteal_anon",
> +	[I(PGSTEAL_FILE)]			= "pgsteal_file",
> +	[I(PGSCAN_KSWAPD)]			= "pgscan_kswapd",
> +	[I(PGSCAN_DIRECT)]			= "pgscan_direct",
> +	[I(PGSCAN_KHUGEPAGED)]			= "pgscan_khugepaged",
> +	[I(PGSCAN_PROACTIVE)]			= "pgscan_proactive",
> +	[I(PGSCAN_ANON)]			= "pgscan_anon",
> +	[I(PGSCAN_FILE)]			= "pgscan_file",
>  #ifdef CONFIG_NUMA
>  	[I(PGALLOC_MPOL_DEFAULT)]		= "pgalloc_mpol_default",
>  	[I(PGALLOC_MPOL_PREFERRED)]		= "pgalloc_mpol_preferred",
> @@ -1344,19 +1356,7 @@ const char * const vmstat_text[] = {
>  
>  	[I(PGREFILL)]				= "pgrefill",
>  	[I(PGREUSE)]				= "pgreuse",
> -	[I(PGSTEAL_KSWAPD)]			= "pgsteal_kswapd",
> -	[I(PGSTEAL_DIRECT)]			= "pgsteal_direct",
> -	[I(PGSTEAL_KHUGEPAGED)]			= "pgsteal_khugepaged",
> -	[I(PGSTEAL_PROACTIVE)]			= "pgsteal_proactive",
> -	[I(PGSCAN_KSWAPD)]			= "pgscan_kswapd",
> -	[I(PGSCAN_DIRECT)]			= "pgscan_direct",
> -	[I(PGSCAN_KHUGEPAGED)]			= "pgscan_khugepaged",
> -	[I(PGSCAN_PROACTIVE)]			= "pgscan_proactive",
>  	[I(PGSCAN_DIRECT_THROTTLE)]		= "pgscan_direct_throttle",
> -	[I(PGSCAN_ANON)]			= "pgscan_anon",
> -	[I(PGSCAN_FILE)]			= "pgscan_file",
> -	[I(PGSTEAL_ANON)]			= "pgsteal_anon",
> -	[I(PGSTEAL_FILE)]			= "pgsteal_file",
>  
>  #ifdef CONFIG_NUMA
>  	[I(PGSCAN_ZONE_RECLAIM_SUCCESS)]	= "zone_reclaim_success",
> -- 
> 2.47.3



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm: move pgscan and pgsteal to node stats
  2026-02-12  7:08   ` Michael S. Tsirkin
@ 2026-02-12 21:23     ` JP Kobryn
  0 siblings, 0 replies; 27+ messages in thread
From: JP Kobryn @ 2026-02-12 21:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost,
	mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim,
	roman.gushchin, shakeel.butt, surenb, virtualization, vbabka,
	weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team

On 2/11/26 11:08 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 11, 2026 at 08:51:09PM -0800, JP Kobryn wrote:
>> It would be useful to narrow down reclaim to specific nodes.
>>
>> Provide per-node reclaim visibility by changing the pgscan and pgsteal
>> stats from global vm_event_item's to node_stat_item's. Note this change has
>> the side effect of now tracking these stats on a per-memcg basis.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> virtio_balloon changes

Thanks. I'll make sure to spell this out in the v2 changelog.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm: move pgscan and pgsteal to node stats
  2026-02-12  4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn
  2026-02-12  7:08   ` Michael S. Tsirkin
@ 2026-02-12  7:29   ` Michal Hocko
  2026-02-12 21:20     ` JP Kobryn
  1 sibling, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2026-02-12  7:29 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On Wed 11-02-26 20:51:09, JP Kobryn wrote:
> It would be useful to narrow down reclaim to specific nodes.
> 
> Provide per-node reclaim visibility by changing the pgscan and pgsteal
> stats from global vm_event_item's to node_stat_item's. Note this change has
> the side effect of now tracking these stats on a per-memcg basis.

The changelog could have been more clear about the actual changes as
this is not overly clear for untrained eyes. The most important parts
are that /proc/vmstat will preserve reclaim stats with slightly
different counters ordering (shouldn't break userspace much^W), per-node
stats will be now newly displayed in /proc/zoneinfo - this is presumably
the primary motivation to have a better insight of per-node reclaim
activity, and memcg stats will now show their share of the global memory
reclaim.

Have I missed anything?

> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  drivers/virtio/virtio_balloon.c |  8 ++++----
>  include/linux/mmzone.h          | 12 +++++++++++
>  include/linux/vm_event_item.h   | 12 -----------
>  mm/memcontrol.c                 | 36 ++++++++++++++++++---------------
>  mm/vmscan.c                     | 32 +++++++++++------------------
>  mm/vmstat.c                     | 24 +++++++++++-----------
>  6 files changed, 60 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 74fe59f5a78c..1341d9d1a2a1 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -374,13 +374,13 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb)
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
>  
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,
> -		    pages_to_bytes(events[PGSCAN_KSWAPD]));
> +		    pages_to_bytes(global_node_page_state(PGSCAN_KSWAPD)));
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
> -		    pages_to_bytes(events[PGSCAN_DIRECT]));
> +		    pages_to_bytes(global_node_page_state(PGSCAN_DIRECT)));
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
> -		    pages_to_bytes(events[PGSTEAL_KSWAPD]));
> +		    pages_to_bytes(global_node_page_state(PGSTEAL_KSWAPD)));
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
> -		    pages_to_bytes(events[PGSTEAL_DIRECT]));
> +		    pages_to_bytes(global_node_page_state(PGSTEAL_DIRECT)));
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>  	update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 762609d5f0af..fc39c107a4b5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -255,6 +255,18 @@ enum node_stat_item {
>  	PGDEMOTE_DIRECT,
>  	PGDEMOTE_KHUGEPAGED,
>  	PGDEMOTE_PROACTIVE,
> +	PGSTEAL_KSWAPD,
> +	PGSTEAL_DIRECT,
> +	PGSTEAL_KHUGEPAGED,
> +	PGSTEAL_PROACTIVE,
> +	PGSTEAL_ANON,
> +	PGSTEAL_FILE,
> +	PGSCAN_KSWAPD,
> +	PGSCAN_DIRECT,
> +	PGSCAN_KHUGEPAGED,
> +	PGSCAN_PROACTIVE,
> +	PGSCAN_ANON,
> +	PGSCAN_FILE,
>  #ifdef CONFIG_NUMA
>  	PGALLOC_MPOL_DEFAULT,
>  	PGALLOC_MPOL_PREFERRED,
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 92f80b4d69a6..6f1787680658 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -40,19 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		PGLAZYFREED,
>  		PGREFILL,
>  		PGREUSE,
> -		PGSTEAL_KSWAPD,
> -		PGSTEAL_DIRECT,
> -		PGSTEAL_KHUGEPAGED,
> -		PGSTEAL_PROACTIVE,
> -		PGSCAN_KSWAPD,
> -		PGSCAN_DIRECT,
> -		PGSCAN_KHUGEPAGED,
> -		PGSCAN_PROACTIVE,
>  		PGSCAN_DIRECT_THROTTLE,
> -		PGSCAN_ANON,
> -		PGSCAN_FILE,
> -		PGSTEAL_ANON,
> -		PGSTEAL_FILE,
>  #ifdef CONFIG_NUMA
>  		PGSCAN_ZONE_RECLAIM_SUCCESS,
>  		PGSCAN_ZONE_RECLAIM_FAILED,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 86f43b7e5f71..bde0b6536be6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -328,6 +328,18 @@ static const unsigned int memcg_node_stat_items[] = {
>  	PGDEMOTE_DIRECT,
>  	PGDEMOTE_KHUGEPAGED,
>  	PGDEMOTE_PROACTIVE,
> +	PGSTEAL_KSWAPD,
> +	PGSTEAL_DIRECT,
> +	PGSTEAL_KHUGEPAGED,
> +	PGSTEAL_PROACTIVE,
> +	PGSTEAL_ANON,
> +	PGSTEAL_FILE,
> +	PGSCAN_KSWAPD,
> +	PGSCAN_DIRECT,
> +	PGSCAN_KHUGEPAGED,
> +	PGSCAN_PROACTIVE,
> +	PGSCAN_ANON,
> +	PGSCAN_FILE,
>  #ifdef CONFIG_HUGETLB_PAGE
>  	NR_HUGETLB,
>  #endif
> @@ -441,14 +453,6 @@ static const unsigned int memcg_vm_event_stat[] = {
>  #endif
>  	PSWPIN,
>  	PSWPOUT,
> -	PGSCAN_KSWAPD,
> -	PGSCAN_DIRECT,
> -	PGSCAN_KHUGEPAGED,
> -	PGSCAN_PROACTIVE,
> -	PGSTEAL_KSWAPD,
> -	PGSTEAL_DIRECT,
> -	PGSTEAL_KHUGEPAGED,
> -	PGSTEAL_PROACTIVE,
>  	PGFAULT,
>  	PGMAJFAULT,
>  	PGREFILL,
> @@ -1496,15 +1500,15 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>  
>  	/* Accumulated memory events */
>  	seq_buf_printf(s, "pgscan %lu\n",
> -		       memcg_events(memcg, PGSCAN_KSWAPD) +
> -		       memcg_events(memcg, PGSCAN_DIRECT) +
> -		       memcg_events(memcg, PGSCAN_PROACTIVE) +
> -		       memcg_events(memcg, PGSCAN_KHUGEPAGED));
> +		       memcg_page_state(memcg, PGSCAN_KSWAPD) +
> +		       memcg_page_state(memcg, PGSCAN_DIRECT) +
> +		       memcg_page_state(memcg, PGSCAN_PROACTIVE) +
> +		       memcg_page_state(memcg, PGSCAN_KHUGEPAGED));
>  	seq_buf_printf(s, "pgsteal %lu\n",
> -		       memcg_events(memcg, PGSTEAL_KSWAPD) +
> -		       memcg_events(memcg, PGSTEAL_DIRECT) +
> -		       memcg_events(memcg, PGSTEAL_PROACTIVE) +
> -		       memcg_events(memcg, PGSTEAL_KHUGEPAGED));
> +		       memcg_page_state(memcg, PGSTEAL_KSWAPD) +
> +		       memcg_page_state(memcg, PGSTEAL_DIRECT) +
> +		       memcg_page_state(memcg, PGSTEAL_PROACTIVE) +
> +		       memcg_page_state(memcg, PGSTEAL_KHUGEPAGED));
>  
>  	for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) {
>  #ifdef CONFIG_MEMCG_V1
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 614ccf39fe3f..16a0f21e3ea1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1977,7 +1977,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  	unsigned long nr_taken;
>  	struct reclaim_stat stat;
>  	bool file = is_file_lru(lru);
> -	enum vm_event_item item;
> +	enum node_stat_item item;
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	bool stalled = false;
>  
> @@ -2003,10 +2003,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>  	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
> -	if (!cgroup_reclaim(sc))
> -		__count_vm_events(item, nr_scanned);
> -	count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> -	__count_vm_events(PGSCAN_ANON + file, nr_scanned);
> +	mod_lruvec_state(lruvec, item, nr_scanned);
> +	mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned);
>  
>  	spin_unlock_irq(&lruvec->lru_lock);
>  
> @@ -2023,10 +2021,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  					stat.nr_demoted);
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>  	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
> -	if (!cgroup_reclaim(sc))
> -		__count_vm_events(item, nr_reclaimed);
> -	count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> -	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> +	mod_lruvec_state(lruvec, item, nr_reclaimed);
> +	mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed);
>  
>  	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>  					nr_scanned - nr_reclaimed);
> @@ -4536,7 +4532,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  {
>  	int i;
>  	int gen;
> -	enum vm_event_item item;
> +	enum node_stat_item item;
>  	int sorted = 0;
>  	int scanned = 0;
>  	int isolated = 0;
> @@ -4595,13 +4591,11 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	}
>  
>  	item = PGSCAN_KSWAPD + reclaimer_offset(sc);
> -	if (!cgroup_reclaim(sc)) {
> -		__count_vm_events(item, isolated);
> +	if (!cgroup_reclaim(sc))
>  		__count_vm_events(PGREFILL, sorted);
> -	}
> -	count_memcg_events(memcg, item, isolated);
> +	mod_lruvec_state(lruvec, item, isolated);
>  	count_memcg_events(memcg, PGREFILL, sorted);
> -	__count_vm_events(PGSCAN_ANON + type, isolated);
> +	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
>  				scanned, skipped, isolated,
>  				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> @@ -4686,7 +4680,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	LIST_HEAD(clean);
>  	struct folio *folio;
>  	struct folio *next;
> -	enum vm_event_item item;
> +	enum node_stat_item item;
>  	struct reclaim_stat stat;
>  	struct lru_gen_mm_walk *walk;
>  	bool skip_retry = false;
> @@ -4750,10 +4744,8 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  					stat.nr_demoted);
>  
>  	item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
> -	if (!cgroup_reclaim(sc))
> -		__count_vm_events(item, reclaimed);
> -	count_memcg_events(memcg, item, reclaimed);
> -	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
> +	mod_lruvec_state(lruvec, item, reclaimed);
> +	mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed);
>  
>  	spin_unlock_irq(&lruvec->lru_lock);
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 74e0ddde1e93..e4b259989d58 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1291,6 +1291,18 @@ const char * const vmstat_text[] = {
>  	[I(PGDEMOTE_DIRECT)]			= "pgdemote_direct",
>  	[I(PGDEMOTE_KHUGEPAGED)]		= "pgdemote_khugepaged",
>  	[I(PGDEMOTE_PROACTIVE)]			= "pgdemote_proactive",
> +	[I(PGSTEAL_KSWAPD)]			= "pgsteal_kswapd",
> +	[I(PGSTEAL_DIRECT)]			= "pgsteal_direct",
> +	[I(PGSTEAL_KHUGEPAGED)]			= "pgsteal_khugepaged",
> +	[I(PGSTEAL_PROACTIVE)]			= "pgsteal_proactive",
> +	[I(PGSTEAL_ANON)]			= "pgsteal_anon",
> +	[I(PGSTEAL_FILE)]			= "pgsteal_file",
> +	[I(PGSCAN_KSWAPD)]			= "pgscan_kswapd",
> +	[I(PGSCAN_DIRECT)]			= "pgscan_direct",
> +	[I(PGSCAN_KHUGEPAGED)]			= "pgscan_khugepaged",
> +	[I(PGSCAN_PROACTIVE)]			= "pgscan_proactive",
> +	[I(PGSCAN_ANON)]			= "pgscan_anon",
> +	[I(PGSCAN_FILE)]			= "pgscan_file",
>  #ifdef CONFIG_NUMA
>  	[I(PGALLOC_MPOL_DEFAULT)]		= "pgalloc_mpol_default",
>  	[I(PGALLOC_MPOL_PREFERRED)]		= "pgalloc_mpol_preferred",
> @@ -1344,19 +1356,7 @@ const char * const vmstat_text[] = {
>  
>  	[I(PGREFILL)]				= "pgrefill",
>  	[I(PGREUSE)]				= "pgreuse",
> -	[I(PGSTEAL_KSWAPD)]			= "pgsteal_kswapd",
> -	[I(PGSTEAL_DIRECT)]			= "pgsteal_direct",
> -	[I(PGSTEAL_KHUGEPAGED)]			= "pgsteal_khugepaged",
> -	[I(PGSTEAL_PROACTIVE)]			= "pgsteal_proactive",
> -	[I(PGSCAN_KSWAPD)]			= "pgscan_kswapd",
> -	[I(PGSCAN_DIRECT)]			= "pgscan_direct",
> -	[I(PGSCAN_KHUGEPAGED)]			= "pgscan_khugepaged",
> -	[I(PGSCAN_PROACTIVE)]			= "pgscan_proactive",
>  	[I(PGSCAN_DIRECT_THROTTLE)]		= "pgscan_direct_throttle",
> -	[I(PGSCAN_ANON)]			= "pgscan_anon",
> -	[I(PGSCAN_FILE)]			= "pgscan_file",
> -	[I(PGSTEAL_ANON)]			= "pgsteal_anon",
> -	[I(PGSTEAL_FILE)]			= "pgsteal_file",
>  
>  #ifdef CONFIG_NUMA
>  	[I(PGSCAN_ZONE_RECLAIM_SUCCESS)]	= "zone_reclaim_success",
> -- 
> 2.47.3
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] mm: move pgscan and pgsteal to node stats
  2026-02-12  7:29   ` Michal Hocko
@ 2026-02-12 21:20     ` JP Kobryn
  0 siblings, 0 replies; 27+ messages in thread
From: JP Kobryn @ 2026-02-12 21:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
	shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
	ying.huang, yuanchu, ziy, kernel-team

On 2/11/26 11:29 PM, Michal Hocko wrote:
> On Wed 11-02-26 20:51:09, JP Kobryn wrote:
>> It would be useful to narrow down reclaim to specific nodes.
>>
>> Provide per-node reclaim visibility by changing the pgscan and pgsteal
>> stats from global vm_event_item's to node_stat_item's. Note this change has
>> the side effect of now tracking these stats on a per-memcg basis.
> 
> The changelog could have been more clear about the actual changes as
> this is not overly clear for untrained eyes. The most important parts
> are that /proc/vmstat will preserve reclaim stats with slightly
> different counters ordering (shouldn't break userspace much^W), per-node
> stats will be now newly displayed in /proc/zoneinfo - this is presumably
> the primary motivation to have a better insight of per-node reclaim
> activity, and memcg stats will now show their share of the global memory
> reclaim.
> 
> Have I missed anything?

That's accurate. Plus aside from reading /proc/zoneinfo they will also
be in /sys/devices/system/node/nodeN/vmstat. I see I could have been
more explicit about this. Let me make additions to the changelog in v2.
Thanks for taking a look.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] improve per-node allocation and reclaim visibility
  2026-02-12  4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn
  2026-02-12  4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
  2026-02-12  4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn
@ 2026-02-12  4:57 ` Matthew Wilcox
  2026-02-12 21:22   ` JP Kobryn
  2026-02-12 18:08 ` [syzbot ci] " syzbot ci
  3 siblings, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2026-02-12  4:57 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim,
	roman.gushchin, shakeel.butt, surenb, virtualization, vbabka,
	weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team

On Wed, Feb 11, 2026 at 08:51:07PM -0800, JP Kobryn wrote:
> We sometimes find ourselves in situations where reclaim kicks in, yet there

who is we?  you haven't indicated any affiliation in your tags.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] improve per-node allocation and reclaim visibility
  2026-02-12  4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox
@ 2026-02-12 21:22   ` JP Kobryn
  2026-02-12 21:53     ` Matthew Wilcox
  0 siblings, 1 reply; 27+ messages in thread
From: JP Kobryn @ 2026-02-12 21:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim,
	roman.gushchin, shakeel.butt, surenb, virtualization, vbabka,
	weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team

On 2/11/26 8:57 PM, Matthew Wilcox wrote:
> On Wed, Feb 11, 2026 at 08:51:07PM -0800, JP Kobryn wrote:
>> We sometimes find ourselves in situations where reclaim kicks in, yet there
> 
> who is we?  you haven't indicated any affiliation in your tags.

Meta. Is there a preferred way of indicating this?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/2] improve per-node allocation and reclaim visibility
  2026-02-12 21:22   ` JP Kobryn
@ 2026-02-12 21:53     ` Matthew Wilcox
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2026-02-12 21:53 UTC (permalink / raw)
  To: JP Kobryn
  Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups,
	david, eperezma, gourry, jasowang, hannes, joshua.hahnjy,
	Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst,
	mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim,
	roman.gushchin, shakeel.butt, surenb, virtualization, vbabka,
	weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team

On Thu, Feb 12, 2026 at 01:22:09PM -0800, JP Kobryn wrote:
> On 2/11/26 8:57 PM, Matthew Wilcox wrote:
> > On Wed, Feb 11, 2026 at 08:51:07PM -0800, JP Kobryn wrote:
> > > We sometimes find ourselves in situations where reclaim kicks in, yet there
> > 
> > who is we?  you haven't indicated any affiliation in your tags.
> 
> Meta. Is there a preferred way of indicating this?

Documentation/process/submitting-patches.rst:

From Line
^^^^^^^^^

The ``from`` line must be the very first line in the message body,
and has the form:

        From: Patch Author <author@example.com>

The ``from`` line specifies who will be credited as the author of the
patch in the permanent changelog.  If the ``from`` line is missing,
then the ``From:`` line from the email header will be used to determine
the patch author in the changelog.

The author may indicate their affiliation or the sponsor of the work
by adding the name of an organization to the ``from`` and ``SoB`` lines,
e.g.:

        From: Patch Author (Company) <author@example.com>


I do this with ~/.gitconfig

[user]
        name = Matthew Wilcox (Oracle)
        email = willy@infradead.org

and it goes into the From and Signed-off-by lines correctly when
generating patches.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [syzbot ci] Re: improve per-node allocation and reclaim visibility
  2026-02-12  4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn
                   ` (2 preceding siblings ...)
  2026-02-12  4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox
@ 2026-02-12 18:08 ` syzbot ci
  3 siblings, 0 replies; 27+ messages in thread
From: syzbot ci @ 2026-02-12 18:08 UTC (permalink / raw)
  To: akpm, apopple, axelrasmussen, byungchul, cgroups, david,
	eperezma, gourry, hannes, inwardvessel, jasowang, joshua.hahnjy,
	kernel-team, liam.howlett, linux-kernel, linux-mm,
	lorenzo.stoakes, matthew.brost, mhocko, mst, muchun.song,
	rakie.kim, roman.gushchin, rppt, shakeel.butt, surenb, vbabka,
	virtualization, weixugc, xuanzhuo, ying.huang, yuanchu,
	zhengqi.arch, ziy
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] improve per-node allocation and reclaim visibility
https://lore.kernel.org/all/20260212045109.255391-1-inwardvessel@gmail.com
* [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
* [PATCH 2/2] mm: move pgscan and pgsteal to node stats

and found the following issue:
WARNING in __mod_node_page_state

Full report is available here:
https://ci.syzbot.org/series/4ec12ede-3298-43a3-ab6b-79d47759672e

***

WARNING in __mod_node_page_state

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      72a46cdd4ef13690beb8c5a2f6a2023fd7ef2eb4
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/0f678e4c-a4ba-4f17-8ed7-8ae99e56a463/config

------------[ cut here ]------------
IS_ENABLED(CONFIG_PREEMPT_COUNT) && __lockdep_enabled && (preempt_count() == 0 && this_cpu_read(hardirqs_enabled))
WARNING: mm/vmstat.c:396 at __mod_node_page_state+0x126/0x170, CPU#0: kthreadd/2
Modules linked in:
CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__mod_node_page_state+0x126/0x170
Code: 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 48 89 df 4c 89 e6 44 89 fa e8 68 00 00 00 31 db eb cc 90 0f 0b 90 e9 3e ff ff ff 90 <0f> 0b 90 eb 80 48 c7 c7 e0 c6 64 8e 4c 89 f6 e8 66 3c d3 02 e9 28
RSP: 0000:ffffc900000773d0 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 000000000000003d RDI: ffff88815fffb380
RBP: dffffc0000000000 R08: ffffffff8fef2977 R09: 1ffffffff1fde52e
R10: dffffc0000000000 R11: fffffbfff1fde52f R12: ffff88815fffb380
R13: ffffffff92f50f00 R14: 000000000000003d R15: 000000000000003d
FS:  0000000000000000(0000) GS:ffff88818e0f0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000e346000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 alloc_pages_mpol+0x407/0x740
 alloc_pages_noprof+0xa8/0x190
 get_free_pages_noprof+0xf/0x80
 __kasan_populate_vmalloc+0x38/0x1d0
 alloc_vmap_area+0xd21/0x1460
 __get_vm_area_node+0x1f8/0x300
 __vmalloc_node_range_noprof+0x372/0x1730
 __vmalloc_node_noprof+0xc2/0x100
 dup_task_struct+0x228/0x9a0
 copy_process+0x508/0x3980
 kernel_clone+0x248/0x870
 kernel_thread+0x13f/0x1b0
 kthreadd+0x4f9/0x6f0
 ret_from_fork+0x51b/0xa40
 ret_from_fork_asm+0x1a/0x30
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-02-18  4:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-12  4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn
2026-02-12  4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
2026-02-12  7:29   ` Michal Hocko
2026-02-12 21:22     ` JP Kobryn
2026-02-16  8:26       ` Michal Hocko
2026-02-16 17:50         ` JP Kobryn (Meta)
2026-02-16 21:07           ` Michal Hocko
2026-02-17  7:48             ` JP Kobryn (Meta)
2026-02-17 12:37               ` Michal Hocko
2026-02-17 18:19                 ` JP Kobryn (Meta)
2026-02-17 18:52                   ` Michal Hocko
2026-02-12 15:07   ` Shakeel Butt
2026-02-12 21:23     ` JP Kobryn
2026-02-12 15:24   ` Vlastimil Babka
2026-02-12 21:25     ` JP Kobryn
2026-02-13  8:54       ` Vlastimil Babka
2026-02-13 19:56         ` JP Kobryn (Meta)
2026-02-18  4:25   ` kernel test robot
2026-02-12  4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn
2026-02-12  7:08   ` Michael S. Tsirkin
2026-02-12 21:23     ` JP Kobryn
2026-02-12  7:29   ` Michal Hocko
2026-02-12 21:20     ` JP Kobryn
2026-02-12  4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox
2026-02-12 21:22   ` JP Kobryn
2026-02-12 21:53     ` Matthew Wilcox
2026-02-12 18:08 ` [syzbot ci] " syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox