* [PATCH 0/2] improve per-node allocation and reclaim visibility
@ 2026-02-12 4:51 JP Kobryn
2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
` (3 more replies)
0 siblings, 4 replies; 27+ messages in thread
From: JP Kobryn @ 2026-02-12 4:51 UTC (permalink / raw)
To: linux-mm
Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david,
eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett,
linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt,
muchun.song, zhengqi.arch, rakie.kim, roman.gushchin,
shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo,
ying.huang, yuanchu, ziy, kernel-team
We sometimes find ourselves in situations where reclaim kicks in, yet there
is free memory available on the system. One possible explanation is that a
NUMA node under pressure has triggered the reclaim. This NUMA imbalance
scenario could be made easier to diagnose if we had better visibility.
This series aims to provide that visibility by accounting for the cause and
effect of the imbalance. First, the addition of new node stats allows for
tracking of allocations done on a per-policy basis. If a node is under
pressure, these stats can help reveal the cause of how it got there.
Second, the stats associated with reclaim are changed from vm_event_item to
node_stat_item. Having the pgsteal and pgscan counters tracked on a
per-node basis reveals the effect of any pressure, and allows us to quickly
narrow down the affected node(s).
JP Kobryn (2):
mm/mempolicy: track page allocations per mempolicy
mm: move pgscan and pgsteal to node stats
drivers/virtio/virtio_balloon.c | 8 ++++----
include/linux/mmzone.h | 21 +++++++++++++++++++
include/linux/vm_event_item.h | 12 -----------
mm/memcontrol.c | 36 ++++++++++++++++++---------------
mm/mempolicy.c | 30 +++++++++++++++++++++++++--
mm/vmscan.c | 32 +++++++++++------------------
mm/vmstat.c | 33 +++++++++++++++++++-----------
7 files changed, 106 insertions(+), 66 deletions(-)
--
2.47.3
^ permalink raw reply [flat|nested] 27+ messages in thread* [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn @ 2026-02-12 4:51 ` JP Kobryn 2026-02-12 7:29 ` Michal Hocko ` (3 more replies) 2026-02-12 4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn ` (2 subsequent siblings) 3 siblings, 4 replies; 27+ messages in thread From: JP Kobryn @ 2026-02-12 4:51 UTC (permalink / raw) To: linux-mm Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team It would be useful to see a breakdown of allocations to understand which NUMA policies are driving them. For example, when investigating memory pressure, having policy-specific counts could show that allocations were bound to the affected node (via MPOL_BIND). Add per-policy page allocation counters as new node stat items. These counters can provide correlation between a mempolicy and pressure on a given node. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> --- include/linux/mmzone.h | 9 +++++++++ mm/mempolicy.c | 30 ++++++++++++++++++++++++++++-- mm/vmstat.c | 9 +++++++++ 3 files changed, 46 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fc5d6c88d2f0..762609d5f0af 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -255,6 +255,15 @@ enum node_stat_item { PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, PGDEMOTE_PROACTIVE, +#ifdef CONFIG_NUMA + PGALLOC_MPOL_DEFAULT, + PGALLOC_MPOL_PREFERRED, + PGALLOC_MPOL_BIND, + PGALLOC_MPOL_INTERLEAVE, + PGALLOC_MPOL_LOCAL, + PGALLOC_MPOL_PREFERRED_MANY, + PGALLOC_MPOL_WEIGHTED_INTERLEAVE, +#endif #ifdef CONFIG_HUGETLB_PAGE NR_HUGETLB, #endif diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 68a98ba57882..3c64784af761 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -217,6 +217,21 @@ static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw) new_iw[nid] /= iw_gcd; } +#define CHECK_MPOL_NODE_STAT_OFFSET(mpol) \ + BUILD_BUG_ON(PGALLOC_##mpol - mpol != PGALLOC_MPOL_DEFAULT) + +static enum node_stat_item mpol_node_stat(unsigned short mode) +{ + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_PREFERRED); + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_BIND); + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_INTERLEAVE); + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_LOCAL); + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_PREFERRED_MANY); + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_WEIGHTED_INTERLEAVE); + + return PGALLOC_MPOL_DEFAULT + mode; +} + int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) { struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; @@ -2446,8 +2461,14 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, nodemask = policy_nodemask(gfp, pol, ilx, &nid); - if (pol->mode == MPOL_PREFERRED_MANY) - return alloc_pages_preferred_many(gfp, order, nid, nodemask); + if (pol->mode == MPOL_PREFERRED_MANY) { + page = alloc_pages_preferred_many(gfp, order, nid, nodemask); + if (page) + __mod_node_page_state(page_pgdat(page), + mpol_node_stat(MPOL_PREFERRED_MANY), 1 << order); + + return page; + } if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && /* filter "hugepage" allocation, unless from alloc_pages() */ @@ -2472,6 +2493,9 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, page = __alloc_frozen_pages_noprof( gfp | __GFP_THISNODE | __GFP_NORETRY, order, nid, NULL); + if (page) + __mod_node_page_state(page_pgdat(page), + mpol_node_stat(pol->mode), 1 << order); if (page || !(gfp & __GFP_DIRECT_RECLAIM)) return page; /* @@ -2484,6 +2508,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, } page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask); + if (page) + __mod_node_page_state(page_pgdat(page), mpol_node_stat(pol->mode), 1 << order); if (unlikely(pol->mode == MPOL_INTERLEAVE || pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) { diff --git a/mm/vmstat.c b/mm/vmstat.c index 65de88cdf40e..74e0ddde1e93 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1291,6 +1291,15 @@ const char * const vmstat_text[] = { [I(PGDEMOTE_DIRECT)] = "pgdemote_direct", [I(PGDEMOTE_KHUGEPAGED)] = "pgdemote_khugepaged", [I(PGDEMOTE_PROACTIVE)] = "pgdemote_proactive", +#ifdef CONFIG_NUMA + [I(PGALLOC_MPOL_DEFAULT)] = "pgalloc_mpol_default", + [I(PGALLOC_MPOL_PREFERRED)] = "pgalloc_mpol_preferred", + [I(PGALLOC_MPOL_BIND)] = "pgalloc_mpol_bind", + [I(PGALLOC_MPOL_INTERLEAVE)] = "pgalloc_mpol_interleave", + [I(PGALLOC_MPOL_LOCAL)] = "pgalloc_mpol_local", + [I(PGALLOC_MPOL_PREFERRED_MANY)] = "pgalloc_mpol_preferred_many", + [I(PGALLOC_MPOL_WEIGHTED_INTERLEAVE)] = "pgalloc_mpol_weighted_interleave", +#endif #ifdef CONFIG_HUGETLB_PAGE [I(NR_HUGETLB)] = "nr_hugetlb", #endif -- 2.47.3 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn @ 2026-02-12 7:29 ` Michal Hocko 2026-02-12 21:22 ` JP Kobryn 2026-02-12 15:07 ` Shakeel Butt ` (2 subsequent siblings) 3 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2026-02-12 7:29 UTC (permalink / raw) To: JP Kobryn Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Wed 11-02-26 20:51:08, JP Kobryn wrote: > It would be useful to see a breakdown of allocations to understand which > NUMA policies are driving them. For example, when investigating memory > pressure, having policy-specific counts could show that allocations were > bound to the affected node (via MPOL_BIND). > > Add per-policy page allocation counters as new node stat items. These > counters can provide correlation between a mempolicy and pressure on a > given node. Could you be more specific how exactly do you plan to use those counters? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 7:29 ` Michal Hocko @ 2026-02-12 21:22 ` JP Kobryn 2026-02-16 8:26 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: JP Kobryn @ 2026-02-12 21:22 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/11/26 11:29 PM, Michal Hocko wrote: > On Wed 11-02-26 20:51:08, JP Kobryn wrote: >> It would be useful to see a breakdown of allocations to understand which >> NUMA policies are driving them. For example, when investigating memory >> pressure, having policy-specific counts could show that allocations were >> bound to the affected node (via MPOL_BIND). >> >> Add per-policy page allocation counters as new node stat items. These >> counters can provide correlation between a mempolicy and pressure on a >> given node. > > Could you be more specific how exactly do you plan to use those > counters? Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once we identify the affected node(s), the new mpol counters (this patch) allow us correlate the pressure to the mempolicy driving it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 21:22 ` JP Kobryn @ 2026-02-16 8:26 ` Michal Hocko 2026-02-16 17:50 ` JP Kobryn (Meta) 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2026-02-16 8:26 UTC (permalink / raw) To: JP Kobryn Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Thu 12-02-26 13:22:56, JP Kobryn wrote: > On 2/11/26 11:29 PM, Michal Hocko wrote: > > On Wed 11-02-26 20:51:08, JP Kobryn wrote: > > > It would be useful to see a breakdown of allocations to understand which > > > NUMA policies are driving them. For example, when investigating memory > > > pressure, having policy-specific counts could show that allocations were > > > bound to the affected node (via MPOL_BIND). > > > > > > Add per-policy page allocation counters as new node stat items. These > > > counters can provide correlation between a mempolicy and pressure on a > > > given node. > > > > Could you be more specific how exactly do you plan to use those > > counters? > > Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once > we identify the affected node(s), the new mpol counters (this patch) > allow us correlate the pressure to the mempolicy driving it. I would appreciate somehow more specificity. You are adding counters that are not really easy to drop once they are in. Sure we have precedence of dropping some counters in the past so this is not as hard as usual userspace APIs but still... How exactly do you tolerate mempolicy allocations to specific nodes? While MPOL_MBIND is quite straightforward others are less so. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-16 8:26 ` Michal Hocko @ 2026-02-16 17:50 ` JP Kobryn (Meta) 2026-02-16 21:07 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: JP Kobryn (Meta) @ 2026-02-16 17:50 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/16/26 12:26 AM, Michal Hocko wrote: > On Thu 12-02-26 13:22:56, JP Kobryn wrote: >> On 2/11/26 11:29 PM, Michal Hocko wrote: >>> On Wed 11-02-26 20:51:08, JP Kobryn wrote: >>>> It would be useful to see a breakdown of allocations to understand which >>>> NUMA policies are driving them. For example, when investigating memory >>>> pressure, having policy-specific counts could show that allocations were >>>> bound to the affected node (via MPOL_BIND). >>>> >>>> Add per-policy page allocation counters as new node stat items. These >>>> counters can provide correlation between a mempolicy and pressure on a >>>> given node. >>> >>> Could you be more specific how exactly do you plan to use those >>> counters? >> >> Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once >> we identify the affected node(s), the new mpol counters (this patch) >> allow us correlate the pressure to the mempolicy driving it. > > I would appreciate somehow more specificity. You are adding counters > that are not really easy to drop once they are in. Sure we have > precedence of dropping some counters in the past so this is not as hard > as usual userspace APIs but still... > > How exactly do you tolerate mempolicy allocations to specific nodes? > While MPOL_MBIND is quite straightforward others are less so. The design does account for this regardless of the policy. In the call to __mod_node_page_state(), I'm using page_pgdat(page) so the stat is attributed to the node where the page actually landed. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-16 17:50 ` JP Kobryn (Meta) @ 2026-02-16 21:07 ` Michal Hocko 2026-02-17 7:48 ` JP Kobryn (Meta) 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2026-02-16 21:07 UTC (permalink / raw) To: JP Kobryn (Meta) Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Mon 16-02-26 09:50:26, JP Kobryn (Meta) wrote: > On 2/16/26 12:26 AM, Michal Hocko wrote: > > On Thu 12-02-26 13:22:56, JP Kobryn wrote: > > > On 2/11/26 11:29 PM, Michal Hocko wrote: > > > > On Wed 11-02-26 20:51:08, JP Kobryn wrote: > > > > > It would be useful to see a breakdown of allocations to understand which > > > > > NUMA policies are driving them. For example, when investigating memory > > > > > pressure, having policy-specific counts could show that allocations were > > > > > bound to the affected node (via MPOL_BIND). > > > > > > > > > > Add per-policy page allocation counters as new node stat items. These > > > > > counters can provide correlation between a mempolicy and pressure on a > > > > > given node. > > > > > > > > Could you be more specific how exactly do you plan to use those > > > > counters? > > > > > > Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once > > > we identify the affected node(s), the new mpol counters (this patch) > > > allow us correlate the pressure to the mempolicy driving it. > > > > I would appreciate somehow more specificity. You are adding counters > > that are not really easy to drop once they are in. Sure we have > > precedence of dropping some counters in the past so this is not as hard > > as usual userspace APIs but still... > > > > How exactly do you tolerate mempolicy allocations to specific nodes? > > While MPOL_MBIND is quite straightforward others are less so. > > The design does account for this regardless of the policy. In the call > to __mod_node_page_state(), I'm using page_pgdat(page) so the stat is > attributed to the node where the page actually landed. That much is clear[*]. The consumer side of things is not really clear to me. How do you know which policy or part of the nodemask of that policy is the source of the memory pressure on a particular node? In other words how much is the data actually useful except for a single node mempolicy (i.e. MBIND). [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the target node even when the allocation is from a remote node from the "local" POV. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-16 21:07 ` Michal Hocko @ 2026-02-17 7:48 ` JP Kobryn (Meta) 2026-02-17 12:37 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: JP Kobryn (Meta) @ 2026-02-17 7:48 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/16/26 1:07 PM, Michal Hocko wrote: > On Mon 16-02-26 09:50:26, JP Kobryn (Meta) wrote: >> On 2/16/26 12:26 AM, Michal Hocko wrote: >>> On Thu 12-02-26 13:22:56, JP Kobryn wrote: >>>> On 2/11/26 11:29 PM, Michal Hocko wrote: >>>>> On Wed 11-02-26 20:51:08, JP Kobryn wrote: >>>>>> It would be useful to see a breakdown of allocations to understand which >>>>>> NUMA policies are driving them. For example, when investigating memory >>>>>> pressure, having policy-specific counts could show that allocations were >>>>>> bound to the affected node (via MPOL_BIND). >>>>>> >>>>>> Add per-policy page allocation counters as new node stat items. These >>>>>> counters can provide correlation between a mempolicy and pressure on a >>>>>> given node. >>>>> >>>>> Could you be more specific how exactly do you plan to use those >>>>> counters? >>>> >>>> Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once >>>> we identify the affected node(s), the new mpol counters (this patch) >>>> allow us correlate the pressure to the mempolicy driving it. >>> >>> I would appreciate somehow more specificity. You are adding counters >>> that are not really easy to drop once they are in. Sure we have >>> precedence of dropping some counters in the past so this is not as hard >>> as usual userspace APIs but still... >>> >>> How exactly do you tolerate mempolicy allocations to specific nodes? >>> While MPOL_MBIND is quite straightforward others are less so. >> >> The design does account for this regardless of the policy. In the call >> to __mod_node_page_state(), I'm using page_pgdat(page) so the stat is >> attributed to the node where the page actually landed. > > That much is clear[*]. The consumer side of things is not really clear to > me. How do you know which policy or part of the nodemask of that policy > is the source of the memory pressure on a particular node? In other > words how much is the data actually useful except for a single node > mempolicy (i.e. MBIND). Other than the bind policy, having the interleave (and weighted) stats would allow us to see the effective distribution of the policy. Pressure could be linked to a user configured weight scheme. I would think it could also help with confirming expected distributions. You brought up the node mask so with the preferred policy, I think this is a good one for using the counters as well. Once we're at the point where we know the node(s) under pressure and then see significant preferred allocs accounted for, we could search the numa_maps that have "prefer:<node>" to find the tasks targeting the affected nodes. I mentioned this on another thread in this series but I'll include here as well and expand some more. For any given policy, the workflow would be: 1) Pressure/OOMs reported while system-wide memory is free. 2) Check per-node pgscan/pgsteal stats (provided by patch 2) to narrow down node(s) under pressure. They become available in /sys/devices/system/node/nodeN/vmstat. 3) Check per-policy allocation counters (this patch) on that node to find what policy was driving it. Same readout at nodeN/vmstat. 4) Now use /proc/*/numa_maps to identify tasks using the policy. > > [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the > target node even when the allocation is from a remote node from the > "local" POV. It's a good point. The accounting as a result of fallback cases shouldn't detract from an investigation though. We're interested in the node(s) under pressure so the relatively few fallback allocations would land on nodes that are not under pressure and could be viewed as acceptable noise. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-17 7:48 ` JP Kobryn (Meta) @ 2026-02-17 12:37 ` Michal Hocko 2026-02-17 18:19 ` JP Kobryn (Meta) 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2026-02-17 12:37 UTC (permalink / raw) To: JP Kobryn (Meta) Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Mon 16-02-26 23:48:42, JP Kobryn (Meta) wrote: > On 2/16/26 1:07 PM, Michal Hocko wrote: [...] > > [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the > > target node even when the allocation is from a remote node from the > > "local" POV. > > It's a good point. The accounting as a result of fallback cases > shouldn't detract from an investigation though. We're interested in the > node(s) under pressure so the relatively few fallback allocations would > land on nodes that are not under pressure and could be viewed as > acceptable noise. This is really confusing. You simply have no means to tell the difference between the requested node and the real node used so you cannot really say whether the memory pressure is because of fallbacks or your mempolicy configurations. That means that you cannot tell the difference between the source of the pressure and victim of that pressure. I am not saying these scheme doesn't work in your particular setup but I do not see this is long term maintainable thing. It is just too easy to get misleading numbers. If we want/need to track mempolicy allocations better than what existing numa_* counters offer then this needs to be thought through I believe. I do not think we should add these counters in this form. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-17 12:37 ` Michal Hocko @ 2026-02-17 18:19 ` JP Kobryn (Meta) 2026-02-17 18:52 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: JP Kobryn (Meta) @ 2026-02-17 18:19 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/17/26 4:37 AM, Michal Hocko wrote: > On Mon 16-02-26 23:48:42, JP Kobryn (Meta) wrote: >> On 2/16/26 1:07 PM, Michal Hocko wrote: > [...] >>> [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the >>> target node even when the allocation is from a remote node from the >>> "local" POV. >> >> It's a good point. The accounting as a result of fallback cases >> shouldn't detract from an investigation though. We're interested in the >> node(s) under pressure so the relatively few fallback allocations would >> land on nodes that are not under pressure and could be viewed as >> acceptable noise. > > This is really confusing. You simply have no means to tell the > difference between the requested node and the real node used so you > cannot really say whether the memory pressure is because of fallbacks or > your mempolicy configurations. That means that you cannot tell the > difference between the source of the pressure and victim of that > pressure. What if I excluded the fallback cases? I could get the actual node from the allocated page and compare against the requested node or node mask. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-17 18:19 ` JP Kobryn (Meta) @ 2026-02-17 18:52 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2026-02-17 18:52 UTC (permalink / raw) To: JP Kobryn (Meta) Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Tue 17-02-26 10:19:08, JP Kobryn (Meta) wrote: > On 2/17/26 4:37 AM, Michal Hocko wrote: > > On Mon 16-02-26 23:48:42, JP Kobryn (Meta) wrote: > > > On 2/16/26 1:07 PM, Michal Hocko wrote: > > [...] > > > > [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the > > > > target node even when the allocation is from a remote node from the > > > > "local" POV. > > > > > > It's a good point. The accounting as a result of fallback cases > > > shouldn't detract from an investigation though. We're interested in the > > > node(s) under pressure so the relatively few fallback allocations would > > > land on nodes that are not under pressure and could be viewed as > > > acceptable noise. > > > > This is really confusing. You simply have no means to tell the > > difference between the requested node and the real node used so you > > cannot really say whether the memory pressure is because of fallbacks or > > your mempolicy configurations. That means that you cannot tell the > > difference between the source of the pressure and victim of that > > pressure. > > What if I excluded the fallback cases? I could get the actual node from > the allocated page and compare against the requested node or node mask. I think it would make sense to send the per-node reclaim stats separately as there doesn't seem to be any dispute about that. For mempolicy stats try to define semantic for each mempolicy first. What exactly do you miss from existing numa_*? Do you want to count number of requests/successes. Do you want to track failures? In what kind of granularity (track fallback nodes)? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn 2026-02-12 7:29 ` Michal Hocko @ 2026-02-12 15:07 ` Shakeel Butt 2026-02-12 21:23 ` JP Kobryn 2026-02-12 15:24 ` Vlastimil Babka 2026-02-18 4:25 ` kernel test robot 3 siblings, 1 reply; 27+ messages in thread From: Shakeel Butt @ 2026-02-12 15:07 UTC (permalink / raw) To: JP Kobryn Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Wed, Feb 11, 2026 at 08:51:08PM -0800, JP Kobryn wrote: > It would be useful to see a breakdown of allocations to understand which > NUMA policies are driving them. For example, when investigating memory > pressure, having policy-specific counts could show that allocations were > bound to the affected node (via MPOL_BIND). > > Add per-policy page allocation counters as new node stat items. These > counters can provide correlation between a mempolicy and pressure on a > given node. > > Signed-off-by: JP Kobryn <inwardvessel@gmail.com> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> [...] > int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) > { > struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; > @@ -2446,8 +2461,14 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > > nodemask = policy_nodemask(gfp, pol, ilx, &nid); > > - if (pol->mode == MPOL_PREFERRED_MANY) > - return alloc_pages_preferred_many(gfp, order, nid, nodemask); > + if (pol->mode == MPOL_PREFERRED_MANY) { > + page = alloc_pages_preferred_many(gfp, order, nid, nodemask); > + if (page) > + __mod_node_page_state(page_pgdat(page), > + mpol_node_stat(MPOL_PREFERRED_MANY), 1 << order); Here and two places below, please use mod_node_page_state() instead of __mod_node_page_state() as __foo() requires preempt disable or if the given stat can be updated in IRQ, then IRQ disable. This code path does not do either of that. > + > + return page; > + } > > if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && > /* filter "hugepage" allocation, unless from alloc_pages() */ > @@ -2472,6 +2493,9 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > page = __alloc_frozen_pages_noprof( > gfp | __GFP_THISNODE | __GFP_NORETRY, order, > nid, NULL); > + if (page) > + __mod_node_page_state(page_pgdat(page), > + mpol_node_stat(pol->mode), 1 << order); > if (page || !(gfp & __GFP_DIRECT_RECLAIM)) > return page; > /* > @@ -2484,6 +2508,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > } > > page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask); > + if (page) > + __mod_node_page_state(page_pgdat(page), mpol_node_stat(pol->mode), 1 << order); > ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 15:07 ` Shakeel Butt @ 2026-02-12 21:23 ` JP Kobryn 0 siblings, 0 replies; 27+ messages in thread From: JP Kobryn @ 2026-02-12 21:23 UTC (permalink / raw) To: Shakeel Butt Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/12/26 7:07 AM, Shakeel Butt wrote: > On Wed, Feb 11, 2026 at 08:51:08PM -0800, JP Kobryn wrote: >> It would be useful to see a breakdown of allocations to understand which >> NUMA policies are driving them. For example, when investigating memory >> pressure, having policy-specific counts could show that allocations were >> bound to the affected node (via MPOL_BIND). >> >> Add per-policy page allocation counters as new node stat items. These >> counters can provide correlation between a mempolicy and pressure on a >> given node. >> >> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> >> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > > [...] > >> int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) >> { >> struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; >> @@ -2446,8 +2461,14 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, >> >> nodemask = policy_nodemask(gfp, pol, ilx, &nid); >> >> - if (pol->mode == MPOL_PREFERRED_MANY) >> - return alloc_pages_preferred_many(gfp, order, nid, nodemask); >> + if (pol->mode == MPOL_PREFERRED_MANY) { >> + page = alloc_pages_preferred_many(gfp, order, nid, nodemask); >> + if (page) >> + __mod_node_page_state(page_pgdat(page), >> + mpol_node_stat(MPOL_PREFERRED_MANY), 1 << order); > > Here and two places below, please use mod_node_page_state() instead of > __mod_node_page_state() as __foo() requires preempt disable or if the > given stat can be updated in IRQ, then IRQ disable. This code path does > not do either of that. Thanks, I also see syzbot flagged this as well. I can make this change in v2. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn 2026-02-12 7:29 ` Michal Hocko 2026-02-12 15:07 ` Shakeel Butt @ 2026-02-12 15:24 ` Vlastimil Babka 2026-02-12 21:25 ` JP Kobryn 2026-02-18 4:25 ` kernel test robot 3 siblings, 1 reply; 27+ messages in thread From: Vlastimil Babka @ 2026-02-12 15:24 UTC (permalink / raw) To: JP Kobryn, linux-mm Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/12/26 05:51, JP Kobryn wrote: > It would be useful to see a breakdown of allocations to understand which > NUMA policies are driving them. For example, when investigating memory > pressure, having policy-specific counts could show that allocations were > bound to the affected node (via MPOL_BIND). > > Add per-policy page allocation counters as new node stat items. These > counters can provide correlation between a mempolicy and pressure on a > given node. > > Signed-off-by: JP Kobryn <inwardvessel@gmail.com> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended in a way that would capture any missing important details? A counter per policy type seems exhaustive, but then on one hand it might be not important to distinguish beetween some of them, and on the other hand it doesn't track the nodemask anyway. > --- > include/linux/mmzone.h | 9 +++++++++ > mm/mempolicy.c | 30 ++++++++++++++++++++++++++++-- > mm/vmstat.c | 9 +++++++++ > 3 files changed, 46 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index fc5d6c88d2f0..762609d5f0af 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -255,6 +255,15 @@ enum node_stat_item { > PGDEMOTE_DIRECT, > PGDEMOTE_KHUGEPAGED, > PGDEMOTE_PROACTIVE, > +#ifdef CONFIG_NUMA > + PGALLOC_MPOL_DEFAULT, > + PGALLOC_MPOL_PREFERRED, > + PGALLOC_MPOL_BIND, > + PGALLOC_MPOL_INTERLEAVE, > + PGALLOC_MPOL_LOCAL, > + PGALLOC_MPOL_PREFERRED_MANY, > + PGALLOC_MPOL_WEIGHTED_INTERLEAVE, > +#endif > #ifdef CONFIG_HUGETLB_PAGE > NR_HUGETLB, > #endif > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 68a98ba57882..3c64784af761 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -217,6 +217,21 @@ static void reduce_interleave_weights(unsigned int *bw, u8 *new_iw) > new_iw[nid] /= iw_gcd; > } > > +#define CHECK_MPOL_NODE_STAT_OFFSET(mpol) \ > + BUILD_BUG_ON(PGALLOC_##mpol - mpol != PGALLOC_MPOL_DEFAULT) > + > +static enum node_stat_item mpol_node_stat(unsigned short mode) > +{ > + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_PREFERRED); > + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_BIND); > + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_INTERLEAVE); > + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_LOCAL); > + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_PREFERRED_MANY); > + CHECK_MPOL_NODE_STAT_OFFSET(MPOL_WEIGHTED_INTERLEAVE); > + > + return PGALLOC_MPOL_DEFAULT + mode; > +} > + > int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) > { > struct weighted_interleave_state *new_wi_state, *old_wi_state = NULL; > @@ -2446,8 +2461,14 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > > nodemask = policy_nodemask(gfp, pol, ilx, &nid); > > - if (pol->mode == MPOL_PREFERRED_MANY) > - return alloc_pages_preferred_many(gfp, order, nid, nodemask); > + if (pol->mode == MPOL_PREFERRED_MANY) { > + page = alloc_pages_preferred_many(gfp, order, nid, nodemask); > + if (page) > + __mod_node_page_state(page_pgdat(page), > + mpol_node_stat(MPOL_PREFERRED_MANY), 1 << order); > + > + return page; > + } > > if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && > /* filter "hugepage" allocation, unless from alloc_pages() */ > @@ -2472,6 +2493,9 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > page = __alloc_frozen_pages_noprof( > gfp | __GFP_THISNODE | __GFP_NORETRY, order, > nid, NULL); > + if (page) > + __mod_node_page_state(page_pgdat(page), > + mpol_node_stat(pol->mode), 1 << order); > if (page || !(gfp & __GFP_DIRECT_RECLAIM)) > return page; > /* > @@ -2484,6 +2508,8 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > } > > page = __alloc_frozen_pages_noprof(gfp, order, nid, nodemask); > + if (page) > + __mod_node_page_state(page_pgdat(page), mpol_node_stat(pol->mode), 1 << order); > > if (unlikely(pol->mode == MPOL_INTERLEAVE || > pol->mode == MPOL_WEIGHTED_INTERLEAVE) && page) { > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 65de88cdf40e..74e0ddde1e93 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1291,6 +1291,15 @@ const char * const vmstat_text[] = { > [I(PGDEMOTE_DIRECT)] = "pgdemote_direct", > [I(PGDEMOTE_KHUGEPAGED)] = "pgdemote_khugepaged", > [I(PGDEMOTE_PROACTIVE)] = "pgdemote_proactive", > +#ifdef CONFIG_NUMA > + [I(PGALLOC_MPOL_DEFAULT)] = "pgalloc_mpol_default", > + [I(PGALLOC_MPOL_PREFERRED)] = "pgalloc_mpol_preferred", > + [I(PGALLOC_MPOL_BIND)] = "pgalloc_mpol_bind", > + [I(PGALLOC_MPOL_INTERLEAVE)] = "pgalloc_mpol_interleave", > + [I(PGALLOC_MPOL_LOCAL)] = "pgalloc_mpol_local", > + [I(PGALLOC_MPOL_PREFERRED_MANY)] = "pgalloc_mpol_preferred_many", > + [I(PGALLOC_MPOL_WEIGHTED_INTERLEAVE)] = "pgalloc_mpol_weighted_interleave", > +#endif > #ifdef CONFIG_HUGETLB_PAGE > [I(NR_HUGETLB)] = "nr_hugetlb", > #endif ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 15:24 ` Vlastimil Babka @ 2026-02-12 21:25 ` JP Kobryn 2026-02-13 8:54 ` Vlastimil Babka 0 siblings, 1 reply; 27+ messages in thread From: JP Kobryn @ 2026-02-12 21:25 UTC (permalink / raw) To: Vlastimil Babka, linux-mm Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/12/26 7:24 AM, Vlastimil Babka wrote: > On 2/12/26 05:51, JP Kobryn wrote: >> It would be useful to see a breakdown of allocations to understand which >> NUMA policies are driving them. For example, when investigating memory >> pressure, having policy-specific counts could show that allocations were >> bound to the affected node (via MPOL_BIND). >> >> Add per-policy page allocation counters as new node stat items. These >> counters can provide correlation between a mempolicy and pressure on a >> given node. >> >> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> >> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > > Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended > in a way that would capture any missing important details? A counter per > policy type seems exhaustive, but then on one hand it might be not important > to distinguish beetween some of them, and on the other hand it doesn't track > the nodemask anyway. The two patches of the series should complement each other. When investigating memory pressure, we could identify the affected nodes (patch 2). Then we can cross-reference the policy-specific stats to find any correlation (this patch). I think extending numa_* counters would call for more permutations to account for the numa stat per policy. I think distinguishing between MPOL_DEFAULT and MPOL_BIND is meaningful, for example. Am I understanding your question? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 21:25 ` JP Kobryn @ 2026-02-13 8:54 ` Vlastimil Babka 2026-02-13 19:56 ` JP Kobryn (Meta) 0 siblings, 1 reply; 27+ messages in thread From: Vlastimil Babka @ 2026-02-13 8:54 UTC (permalink / raw) To: JP Kobryn, linux-mm Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/12/26 22:25, JP Kobryn wrote: > On 2/12/26 7:24 AM, Vlastimil Babka wrote: >> On 2/12/26 05:51, JP Kobryn wrote: >>> It would be useful to see a breakdown of allocations to understand which >>> NUMA policies are driving them. For example, when investigating memory >>> pressure, having policy-specific counts could show that allocations were >>> bound to the affected node (via MPOL_BIND). >>> >>> Add per-policy page allocation counters as new node stat items. These >>> counters can provide correlation between a mempolicy and pressure on a >>> given node. >>> >>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> >>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> >> >> Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended >> in a way that would capture any missing important details? A counter per >> policy type seems exhaustive, but then on one hand it might be not important >> to distinguish beetween some of them, and on the other hand it doesn't track >> the nodemask anyway. > > The two patches of the series should complement each other. When > investigating memory pressure, we could identify the affected nodes > (patch 2). Then we can cross-reference the policy-specific stats to find > any correlation (this patch). > > I think extending numa_* counters would call for more permutations to > account for the numa stat per policy. I think distinguishing between > MPOL_DEFAULT and MPOL_BIND is meaningful, for example. Am I Are there other useful examples or would it be enough to add e.g. a numa_bind counter to the numa_hit/miss/etc? What I'm trying to say the level of detail you are trying to add to the always-on counters seems like more suitable for tracepoints. The counters should be limited to what's known to be useful and not "everything we are able to track and possibly could need one day". > understanding your question? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-13 8:54 ` Vlastimil Babka @ 2026-02-13 19:56 ` JP Kobryn (Meta) 0 siblings, 0 replies; 27+ messages in thread From: JP Kobryn (Meta) @ 2026-02-13 19:56 UTC (permalink / raw) To: Vlastimil Babka, linux-mm Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/13/26 12:54 AM, Vlastimil Babka wrote: > On 2/12/26 22:25, JP Kobryn wrote: >> On 2/12/26 7:24 AM, Vlastimil Babka wrote: >>> On 2/12/26 05:51, JP Kobryn wrote: >>>> It would be useful to see a breakdown of allocations to understand which >>>> NUMA policies are driving them. For example, when investigating memory >>>> pressure, having policy-specific counts could show that allocations were >>>> bound to the affected node (via MPOL_BIND). >>>> >>>> Add per-policy page allocation counters as new node stat items. These >>>> counters can provide correlation between a mempolicy and pressure on a >>>> given node. >>>> >>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> >>>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> >>> >>> Are the numa_{hit,miss,etc.} counters insufficient? Could they be extended >>> in a way that would capture any missing important details? A counter per >>> policy type seems exhaustive, but then on one hand it might be not important >>> to distinguish beetween some of them, and on the other hand it doesn't track >>> the nodemask anyway. >> >> The two patches of the series should complement each other. When >> investigating memory pressure, we could identify the affected nodes >> (patch 2). Then we can cross-reference the policy-specific stats to find >> any correlation (this patch). >> >> I think extending numa_* counters would call for more permutations to >> account for the numa stat per policy. I think distinguishing between >> MPOL_DEFAULT and MPOL_BIND is meaningful, for example. Am I > > Are there other useful examples or would it be enough to add e.g. a > numa_bind counter to the numa_hit/miss/etc? Aside from bind, it's worth emphasizing that with default policy tracking we could see if the local node is the source of pressure. In the interleave case, we would be able to see if the loads are being balanced or, in the weighted case, being distributed properly. On extending the numa stats instead, I looked into this some more. I'm not sure if they're a good fit. They seem more about whether the allocator succeeded at placement rather than which policy drove the allocation. Thoughts? > What I'm trying to say the level of detail you are trying to add to the > always-on counters seems like more suitable for tracepoints. The counters > should be limited to what's known to be useful and not "everything we are > able to track and possibly could need one day". In a triage scenario, having the stats collected up to the time of the reported issue would be better. We make use of the tool called below[0]. It periodically samples the system and allows us to view the historical state prior to the issue. If we started at the time of the incident and attached tracepoints it would be too late. The triage workflow would look like this: 1) Pressure/OOMs reported while system-wide memory is free. 2) Check per-node pgscan/pgsteal stats (provided by patch 2) to narrow down node(s) under pressure. 3) Check per-policy allocation counters (this patch) on that node to find what policy was driving it. [0] https://github.com/facebookincubator/below ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy 2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn ` (2 preceding siblings ...) 2026-02-12 15:24 ` Vlastimil Babka @ 2026-02-18 4:25 ` kernel test robot 3 siblings, 0 replies; 27+ messages in thread From: kernel test robot @ 2026-02-18 4:25 UTC (permalink / raw) To: JP Kobryn Cc: oe-lkp, lkp, Johannes Weiner, linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team, oliver.sang Hello, kernel test robot noticed "RIP:__mod_node_page_state" on: commit: 4b5f69459c0988d3b292aceb74633e04eea84c7f ("[PATCH 1/2] mm/mempolicy: track page allocations per mempolicy") url: https://github.com/intel-lab-lkp/linux/commits/JP-Kobryn/mm-mempolicy-track-page-allocations-per-mempolicy/20260212-142941 base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/all/20260212045109.255391-2-inwardvessel@gmail.com/ patch subject: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy in testcase: boot config: x86_64-randconfig-007-20250327 compiler: gcc-14 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G (please refer to attached dmesg/kmsg for entire log/backtrace) +------------------------------------------------------------------+------------+------------+ | | 5cbf93e36f | 4b5f69459c | +------------------------------------------------------------------+------------+------------+ | boot_successes | 244 | 0 | | boot_failures | 0 | 244 | | RIP:__mod_node_page_state | 0 | 244 | | BUG:using__this_cpu_read()in_preemptible | 0 | 244 | | BUG:using__this_cpu_write()in_preemptible[#]code:kthreadd | 0 | 244 | | BUG:using__this_cpu_write()in_preemptible[#]code:swapper | 0 | 187 | | BUG:using__this_cpu_write()in_preemptible[#]code:kdevtmpfs | 0 | 79 | | BUG:using__this_cpu_write()in_preemptible[#]code:kworker/u8 | 0 | 229 | | BUG:using__this_cpu_write()in_preemptible[#]code:udevd | 0 | 62 | | BUG:using__this_cpu_write()in_preemptible[#]code:tail | 0 | 21 | | BUG:using__this_cpu_write()in_preemptible[#]code:syslogd | 0 | 54 | | BUG:using__this_cpu_write()in_preemptible[#]code:klogd | 0 | 113 | | BUG:using__this_cpu_write()in_preemptible[#]code:sleep | 0 | 98 | | BUG:using__this_cpu_write()in_preemptible[#]code:post-run | 0 | 39 | | BUG:using__this_cpu_write()in_preemptible[#]code:rsync | 0 | 9 | | BUG:using__this_cpu_write()in_preemptible[#]code:modprobe | 0 | 6 | | BUG:using__this_cpu_write()in_preemptible[#]code | 0 | 32 | | BUG:using__this_cpu_write()in_preemptible[#]code:udevadm | 0 | 78 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd | 0 | 39 | | BUG:using__this_cpu_write()in_preemptible[#]code:(udev-worker) | 0 | 53 | | RIP:rep_movs_alternative | 0 | 5 | | BUG:using__this_cpu_write()in_preemptible[#]code:cat | 0 | 7 | | BUG:using__this_cpu_write()in_preemptible[#]code:sed | 0 | 98 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-udevd | 0 | 19 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-journal | 0 | 54 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-random | 0 | 4 | | BUG:using__this_cpu_write()in_preemptible[#]code:journalctl | 0 | 8 | | BUG:using__this_cpu_write()in_preemptible[#]code:start_getty | 0 | 4 | | RIP:__put_user_4 | 0 | 24 | | BUG:using__this_cpu_write()in_preemptible[#]code:wget | 0 | 82 | | BUG:using__this_cpu_write()in_preemptible[#]code:run-lkp | 0 | 32 | | BUG:using__this_cpu_write()in_preemptible[#]code:boot-#-yocto-i3 | 0 | 24 | | BUG:using__this_cpu_write()in_preemptible[#]code:one-shot-monito | 0 | 4 | | BUG:using__this_cpu_write()in_preemptible[#]code:vmstat | 0 | 29 | | BUG:using__this_cpu_write()in_preemptible[#]code:rs:main_Q:Reg | 0 | 9 | | RIP:rep_stos_alternative | 0 | 11 | | BUG:using__this_cpu_write()in_preemptible[#]code:lkp-setup-rootf | 0 | 21 | | BUG:using__this_cpu_write()in_preemptible[#]code:stty | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:tee | 0 | 7 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-rc-loca | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:(exec-inner) | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:groupadd | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:(sd-exec-strv) | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:rc | 0 | 14 | | BUG:using__this_cpu_write()in_preemptible[#]code:getty | 0 | 18 | | BUG:using__this_cpu_write()in_preemptible[#]code:boot-#-debian | 0 | 4 | | BUG:using__this_cpu_write()in_preemptible[#]code:monitor | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-tmpfile | 0 | 6 | | BUG:using__this_cpu_write()in_preemptible[#]code:lscpu | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:dirname | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-sysuser | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:(d-sysctl) | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:mount | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:ls | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:pgrep | 0 | 4 | | BUG:using__this_cpu_write()in_preemptible[#]code:grep | 0 | 8 | | BUG:using__this_cpu_write()in_preemptible[#]code:S77lkp-bootstra | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:date | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-sysctl | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:find | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:sshd | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-system | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-sysv-ge | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-hiberna | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:journal-offline | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:sysctl | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:init | 0 | 7 | | BUG:using__this_cpu_write()in_preemptible[#]code:mkdir | 0 | 6 | | BUG:using__this_cpu_write()in_preemptible[#]code:mountpoint | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-logind | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:dmesg | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-ssh-gen | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:cp | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:wakeup | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:dpkg-deb | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:dpkg | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:(modprobe) | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:sync | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-update | 0 | 4 | | BUG:using__this_cpu_write()in_preemptible[#]code:kmod | 0 | 1 | | RIP:strncpy_from_user | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:sm-notify | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-remount | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:blkmapd | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:mkfifo | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:ln | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:sh | 0 | 5 | | BUG:using__this_cpu_write()in_preemptible[#]code:bootlogd | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:run-test | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:S07bootlogd | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:hwclock.sh | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:(sd-mkdcreds) | 0 | 1 | | RIP:filldir64 | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:chmod | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:ps | 0 | 3 | | BUG:using__this_cpu_write()in_preemptible[#]code:which | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:ip | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:start-stop-daem | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:S20syslog | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-gpt-aut | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-debug-g | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:(rpcbind) | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:seq | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-run-gen | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:wait | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:addgroup | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:rm | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:in:imklog | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:basename | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:touch | 0 | 1 | | RIP:ia32_setup_frame | 0 | 2 | | BUG:using__this_cpu_write()in_preemptible[#]code:no-stdout-monit | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:systemd-tpm#-ge | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:(mount) | 0 | 1 | | BUG:using__this_cpu_write()in_preemptible[#]code:ldconfig | 0 | 1 | +------------------------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202602181136.f66ba888-lkp@intel.com [ 0.624787][ T2] ------------[ cut here ]------------ [ 0.625191][ T2] WARNING: mm/vmstat.c:396 at __mod_node_page_state+0x88/0x1c0, CPU#0: kthreadd/2 [ 0.625887][ T2] Modules linked in: [ 0.626070][ T2] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Tainted: G T 6.19.0-rc6-00596-g4b5f69459c09 #1 PREEMPT(lazy) a55f7fce8adbfb8e52612c1f0ea71f4db1a1df23 [ 0.626084][ T2] Tainted: [T]=RANDSTRUCT [ 0.626402][ T2] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 0.627150][ T2] RIP: 0010:__mod_node_page_state (mm/vmstat.c:396 (discriminator 34)) [ 0.627592][ T2] Code: 8b 05 88 b9 73 02 48 c7 c7 d8 b0 b4 83 85 c0 89 45 d0 40 0f 95 c6 31 c9 31 d2 40 0f b6 f6 e8 3f 96 e4 ff 8b 45 d0 85 c0 74 1b <0f> 0b be 01 00 00 00 eb 14 31 c9 31 d2 31 f6 48 c7 c7 d8 b0 b4 83 All code ======== 0: 8b 05 88 b9 73 02 mov 0x273b988(%rip),%eax # 0x273b98e 6: 48 c7 c7 d8 b0 b4 83 mov $0xffffffff83b4b0d8,%rdi d: 85 c0 test %eax,%eax f: 89 45 d0 mov %eax,-0x30(%rbp) 12: 40 0f 95 c6 setne %sil 16: 31 c9 xor %ecx,%ecx 18: 31 d2 xor %edx,%edx 1a: 40 0f b6 f6 movzbl %sil,%esi 1e: e8 3f 96 e4 ff call 0xffffffffffe49662 23: 8b 45 d0 mov -0x30(%rbp),%eax 26: 85 c0 test %eax,%eax 28: 74 1b je 0x45 2a:* 0f 0b ud2 <-- trapping instruction 2c: be 01 00 00 00 mov $0x1,%esi 31: eb 14 jmp 0x47 33: 31 c9 xor %ecx,%ecx 35: 31 d2 xor %edx,%edx 37: 31 f6 xor %esi,%esi 39: 48 c7 c7 d8 b0 b4 83 mov $0xffffffff83b4b0d8,%rdi Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: be 01 00 00 00 mov $0x1,%esi 7: eb 14 jmp 0x1d 9: 31 c9 xor %ecx,%ecx b: 31 d2 xor %edx,%edx d: 31 f6 xor %esi,%esi f: 48 c7 c7 d8 b0 b4 83 mov $0xffffffff83b4b0d8,%rdi [ 0.629418][ T2] RSP: 0000:ffff88810039fa20 EFLAGS: 00010202 [ 0.629869][ T2] RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000000 [ 0.630445][ T2] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 0.631089][ T2] RBP: ffff88810039fa50 R08: 0000000000000000 R09: 0000000000000000 [ 0.631671][ T2] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88883ffe02c0 [ 0.632247][ T2] R13: ffffffff83f18971 R14: ffffffff83f18940 R15: 0000000000000030 [ 0.632746][ T2] FS: 0000000000000000(0000) GS:ffff88889bd1c000(0000) knlGS:0000000000000000 [ 0.633394][ T2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.633875][ T2] CR2: ffff88883ffff000 CR3: 000000000343d000 CR4: 00000000000406b0 [ 0.634478][ T2] Call Trace: [ 0.634723][ T2] <TASK> [ 0.634951][ T2] alloc_pages_mpol (mm/mempolicy.c:2513 (discriminator 1)) [ 0.635326][ T2] alloc_frozen_pages_noprof (mm/mempolicy.c:2584) [ 0.635746][ T2] allocate_slab (mm/slub.c:3075 (discriminator 2) mm/slub.c:3248 (discriminator 2)) [ 0.636086][ T2] new_slab (mm/slub.c:3304) [ 0.636394][ T2] ___slab_alloc (mm/slub.c:4657) [ 0.636749][ T2] ? dup_task_struct (kernel/fork.c:184 (discriminator 2) kernel/fork.c:915 (discriminator 2)) [ 0.637114][ T2] __slab_alloc+0x8a/0x180 [ 0.637519][ T2] slab_alloc_node+0x189/0x340 [ 0.637919][ T2] ? dup_task_struct (kernel/fork.c:184 (discriminator 2) kernel/fork.c:915 (discriminator 2)) [ 0.638285][ T2] kmem_cache_alloc_node_noprof (mm/slub.c:5317 (discriminator 1)) [ 0.638710][ T2] dup_task_struct (kernel/fork.c:184 (discriminator 2) kernel/fork.c:915 (discriminator 2)) [ 0.639058][ T2] ? ftrace_likely_update (arch/x86/include/asm/smap.h:90 kernel/trace/trace_branch.c:223) [ 0.639416][ T2] copy_process (kernel/fork.c:2052 (discriminator 1)) [ 0.639773][ T2] kernel_clone (include/linux/random.h:26 kernel/fork.c:2652) [ 0.640115][ T2] ? kthread_fetch_affinity (kernel/kthread.c:412) [ 0.640552][ T2] kernel_thread (kernel/fork.c:2713) [ 0.640892][ T2] ? kthread_fetch_affinity (kernel/kthread.c:412) [ 0.641310][ T2] kthreadd (kernel/kthread.c:486 kernel/kthread.c:844) [ 0.641621][ T2] ? kthreadd (kernel/kthread.c:830 (discriminator 5)) [ 0.641938][ T2] ? kthread_is_per_cpu (kernel/kthread.c:816) [ 0.642316][ T2] ret_from_fork (arch/x86/kernel/process.c:164) [ 0.642657][ T2] ? kthread_is_per_cpu (kernel/kthread.c:816) [ 0.642744][ T2] ? kthread_is_per_cpu (kernel/kthread.c:816) [ 0.643127][ T2] ret_from_fork_asm (arch/x86/entry/entry_64.S:256) [ 0.643502][ T2] </TASK> [ 0.643755][ T2] irq event stamp: 393 [ 0.644054][ T2] hardirqs last enabled at (401): __up_console_sem (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/printk/printk.c:345) [ 0.644730][ T2] hardirqs last disabled at (408): __up_console_sem (kernel/printk/printk.c:343 (discriminator 3)) [ 0.645406][ T2] softirqs last enabled at (54): handle_softirqs (kernel/softirq.c:469 (discriminator 1) kernel/softirq.c:650 (discriminator 1)) [ 0.646077][ T2] softirqs last disabled at (49): __irq_exit_rcu (kernel/softirq.c:657 kernel/softirq.c:496 kernel/softirq.c:723) [ 0.646741][ T2] ---[ end trace 0000000000000000 ]--- The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20260218/202602181136.f66ba888-lkp@intel.com -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/2] mm: move pgscan and pgsteal to node stats 2026-02-12 4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn 2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn @ 2026-02-12 4:51 ` JP Kobryn 2026-02-12 7:08 ` Michael S. Tsirkin 2026-02-12 7:29 ` Michal Hocko 2026-02-12 4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox 2026-02-12 18:08 ` [syzbot ci] " syzbot ci 3 siblings, 2 replies; 27+ messages in thread From: JP Kobryn @ 2026-02-12 4:51 UTC (permalink / raw) To: linux-mm Cc: apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team It would be useful to narrow down reclaim to specific nodes. Provide per-node reclaim visibility by changing the pgscan and pgsteal stats from global vm_event_item's to node_stat_item's. Note this change has the side effect of now tracking these stats on a per-memcg basis. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> --- drivers/virtio/virtio_balloon.c | 8 ++++---- include/linux/mmzone.h | 12 +++++++++++ include/linux/vm_event_item.h | 12 ----------- mm/memcontrol.c | 36 ++++++++++++++++++--------------- mm/vmscan.c | 32 +++++++++++------------------ mm/vmstat.c | 24 +++++++++++----------- 6 files changed, 60 insertions(+), 64 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 74fe59f5a78c..1341d9d1a2a1 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -374,13 +374,13 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb) update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN, - pages_to_bytes(events[PGSCAN_KSWAPD])); + pages_to_bytes(global_node_page_state(PGSCAN_KSWAPD))); update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN, - pages_to_bytes(events[PGSCAN_DIRECT])); + pages_to_bytes(global_node_page_state(PGSCAN_DIRECT))); update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM, - pages_to_bytes(events[PGSTEAL_KSWAPD])); + pages_to_bytes(global_node_page_state(PGSTEAL_KSWAPD))); update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM, - pages_to_bytes(events[PGSTEAL_DIRECT])); + pages_to_bytes(global_node_page_state(PGSTEAL_DIRECT))); #ifdef CONFIG_HUGETLB_PAGE update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 762609d5f0af..fc39c107a4b5 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -255,6 +255,18 @@ enum node_stat_item { PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, PGDEMOTE_PROACTIVE, + PGSTEAL_KSWAPD, + PGSTEAL_DIRECT, + PGSTEAL_KHUGEPAGED, + PGSTEAL_PROACTIVE, + PGSTEAL_ANON, + PGSTEAL_FILE, + PGSCAN_KSWAPD, + PGSCAN_DIRECT, + PGSCAN_KHUGEPAGED, + PGSCAN_PROACTIVE, + PGSCAN_ANON, + PGSCAN_FILE, #ifdef CONFIG_NUMA PGALLOC_MPOL_DEFAULT, PGALLOC_MPOL_PREFERRED, diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 92f80b4d69a6..6f1787680658 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -40,19 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, PGLAZYFREED, PGREFILL, PGREUSE, - PGSTEAL_KSWAPD, - PGSTEAL_DIRECT, - PGSTEAL_KHUGEPAGED, - PGSTEAL_PROACTIVE, - PGSCAN_KSWAPD, - PGSCAN_DIRECT, - PGSCAN_KHUGEPAGED, - PGSCAN_PROACTIVE, PGSCAN_DIRECT_THROTTLE, - PGSCAN_ANON, - PGSCAN_FILE, - PGSTEAL_ANON, - PGSTEAL_FILE, #ifdef CONFIG_NUMA PGSCAN_ZONE_RECLAIM_SUCCESS, PGSCAN_ZONE_RECLAIM_FAILED, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 86f43b7e5f71..bde0b6536be6 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -328,6 +328,18 @@ static const unsigned int memcg_node_stat_items[] = { PGDEMOTE_DIRECT, PGDEMOTE_KHUGEPAGED, PGDEMOTE_PROACTIVE, + PGSTEAL_KSWAPD, + PGSTEAL_DIRECT, + PGSTEAL_KHUGEPAGED, + PGSTEAL_PROACTIVE, + PGSTEAL_ANON, + PGSTEAL_FILE, + PGSCAN_KSWAPD, + PGSCAN_DIRECT, + PGSCAN_KHUGEPAGED, + PGSCAN_PROACTIVE, + PGSCAN_ANON, + PGSCAN_FILE, #ifdef CONFIG_HUGETLB_PAGE NR_HUGETLB, #endif @@ -441,14 +453,6 @@ static const unsigned int memcg_vm_event_stat[] = { #endif PSWPIN, PSWPOUT, - PGSCAN_KSWAPD, - PGSCAN_DIRECT, - PGSCAN_KHUGEPAGED, - PGSCAN_PROACTIVE, - PGSTEAL_KSWAPD, - PGSTEAL_DIRECT, - PGSTEAL_KHUGEPAGED, - PGSTEAL_PROACTIVE, PGFAULT, PGMAJFAULT, PGREFILL, @@ -1496,15 +1500,15 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) /* Accumulated memory events */ seq_buf_printf(s, "pgscan %lu\n", - memcg_events(memcg, PGSCAN_KSWAPD) + - memcg_events(memcg, PGSCAN_DIRECT) + - memcg_events(memcg, PGSCAN_PROACTIVE) + - memcg_events(memcg, PGSCAN_KHUGEPAGED)); + memcg_page_state(memcg, PGSCAN_KSWAPD) + + memcg_page_state(memcg, PGSCAN_DIRECT) + + memcg_page_state(memcg, PGSCAN_PROACTIVE) + + memcg_page_state(memcg, PGSCAN_KHUGEPAGED)); seq_buf_printf(s, "pgsteal %lu\n", - memcg_events(memcg, PGSTEAL_KSWAPD) + - memcg_events(memcg, PGSTEAL_DIRECT) + - memcg_events(memcg, PGSTEAL_PROACTIVE) + - memcg_events(memcg, PGSTEAL_KHUGEPAGED)); + memcg_page_state(memcg, PGSTEAL_KSWAPD) + + memcg_page_state(memcg, PGSTEAL_DIRECT) + + memcg_page_state(memcg, PGSTEAL_PROACTIVE) + + memcg_page_state(memcg, PGSTEAL_KHUGEPAGED)); for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { #ifdef CONFIG_MEMCG_V1 diff --git a/mm/vmscan.c b/mm/vmscan.c index 614ccf39fe3f..16a0f21e3ea1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1977,7 +1977,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, unsigned long nr_taken; struct reclaim_stat stat; bool file = is_file_lru(lru); - enum vm_event_item item; + enum node_stat_item item; struct pglist_data *pgdat = lruvec_pgdat(lruvec); bool stalled = false; @@ -2003,10 +2003,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); item = PGSCAN_KSWAPD + reclaimer_offset(sc); - if (!cgroup_reclaim(sc)) - __count_vm_events(item, nr_scanned); - count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); - __count_vm_events(PGSCAN_ANON + file, nr_scanned); + mod_lruvec_state(lruvec, item, nr_scanned); + mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned); spin_unlock_irq(&lruvec->lru_lock); @@ -2023,10 +2021,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, stat.nr_demoted); __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); item = PGSTEAL_KSWAPD + reclaimer_offset(sc); - if (!cgroup_reclaim(sc)) - __count_vm_events(item, nr_reclaimed); - count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); - __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); + mod_lruvec_state(lruvec, item, nr_reclaimed); + mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed); @@ -4536,7 +4532,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, { int i; int gen; - enum vm_event_item item; + enum node_stat_item item; int sorted = 0; int scanned = 0; int isolated = 0; @@ -4595,13 +4591,11 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, } item = PGSCAN_KSWAPD + reclaimer_offset(sc); - if (!cgroup_reclaim(sc)) { - __count_vm_events(item, isolated); + if (!cgroup_reclaim(sc)) __count_vm_events(PGREFILL, sorted); - } - count_memcg_events(memcg, item, isolated); + mod_lruvec_state(lruvec, item, isolated); count_memcg_events(memcg, PGREFILL, sorted); - __count_vm_events(PGSCAN_ANON + type, isolated); + mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated); trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch, scanned, skipped, isolated, type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); @@ -4686,7 +4680,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, LIST_HEAD(clean); struct folio *folio; struct folio *next; - enum vm_event_item item; + enum node_stat_item item; struct reclaim_stat stat; struct lru_gen_mm_walk *walk; bool skip_retry = false; @@ -4750,10 +4744,8 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, stat.nr_demoted); item = PGSTEAL_KSWAPD + reclaimer_offset(sc); - if (!cgroup_reclaim(sc)) - __count_vm_events(item, reclaimed); - count_memcg_events(memcg, item, reclaimed); - __count_vm_events(PGSTEAL_ANON + type, reclaimed); + mod_lruvec_state(lruvec, item, reclaimed); + mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed); spin_unlock_irq(&lruvec->lru_lock); diff --git a/mm/vmstat.c b/mm/vmstat.c index 74e0ddde1e93..e4b259989d58 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1291,6 +1291,18 @@ const char * const vmstat_text[] = { [I(PGDEMOTE_DIRECT)] = "pgdemote_direct", [I(PGDEMOTE_KHUGEPAGED)] = "pgdemote_khugepaged", [I(PGDEMOTE_PROACTIVE)] = "pgdemote_proactive", + [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", + [I(PGSTEAL_DIRECT)] = "pgsteal_direct", + [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", + [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", + [I(PGSTEAL_ANON)] = "pgsteal_anon", + [I(PGSTEAL_FILE)] = "pgsteal_file", + [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", + [I(PGSCAN_DIRECT)] = "pgscan_direct", + [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", + [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", + [I(PGSCAN_ANON)] = "pgscan_anon", + [I(PGSCAN_FILE)] = "pgscan_file", #ifdef CONFIG_NUMA [I(PGALLOC_MPOL_DEFAULT)] = "pgalloc_mpol_default", [I(PGALLOC_MPOL_PREFERRED)] = "pgalloc_mpol_preferred", @@ -1344,19 +1356,7 @@ const char * const vmstat_text[] = { [I(PGREFILL)] = "pgrefill", [I(PGREUSE)] = "pgreuse", - [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", - [I(PGSTEAL_DIRECT)] = "pgsteal_direct", - [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", - [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", - [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", - [I(PGSCAN_DIRECT)] = "pgscan_direct", - [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", - [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", [I(PGSCAN_DIRECT_THROTTLE)] = "pgscan_direct_throttle", - [I(PGSCAN_ANON)] = "pgscan_anon", - [I(PGSCAN_FILE)] = "pgscan_file", - [I(PGSTEAL_ANON)] = "pgsteal_anon", - [I(PGSTEAL_FILE)] = "pgsteal_file", #ifdef CONFIG_NUMA [I(PGSCAN_ZONE_RECLAIM_SUCCESS)] = "zone_reclaim_success", -- 2.47.3 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] mm: move pgscan and pgsteal to node stats 2026-02-12 4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn @ 2026-02-12 7:08 ` Michael S. Tsirkin 2026-02-12 21:23 ` JP Kobryn 2026-02-12 7:29 ` Michal Hocko 1 sibling, 1 reply; 27+ messages in thread From: Michael S. Tsirkin @ 2026-02-12 7:08 UTC (permalink / raw) To: JP Kobryn Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Wed, Feb 11, 2026 at 08:51:09PM -0800, JP Kobryn wrote: > It would be useful to narrow down reclaim to specific nodes. > > Provide per-node reclaim visibility by changing the pgscan and pgsteal > stats from global vm_event_item's to node_stat_item's. Note this change has > the side effect of now tracking these stats on a per-memcg basis. > > Signed-off-by: JP Kobryn <inwardvessel@gmail.com> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> virtio_balloon changes > --- > drivers/virtio/virtio_balloon.c | 8 ++++---- > include/linux/mmzone.h | 12 +++++++++++ > include/linux/vm_event_item.h | 12 ----------- > mm/memcontrol.c | 36 ++++++++++++++++++--------------- > mm/vmscan.c | 32 +++++++++++------------------ > mm/vmstat.c | 24 +++++++++++----------- > 6 files changed, 60 insertions(+), 64 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index 74fe59f5a78c..1341d9d1a2a1 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -374,13 +374,13 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb) > update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); > > update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN, > - pages_to_bytes(events[PGSCAN_KSWAPD])); > + pages_to_bytes(global_node_page_state(PGSCAN_KSWAPD))); > update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN, > - pages_to_bytes(events[PGSCAN_DIRECT])); > + pages_to_bytes(global_node_page_state(PGSCAN_DIRECT))); > update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM, > - pages_to_bytes(events[PGSTEAL_KSWAPD])); > + pages_to_bytes(global_node_page_state(PGSTEAL_KSWAPD))); > update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM, > - pages_to_bytes(events[PGSTEAL_DIRECT])); > + pages_to_bytes(global_node_page_state(PGSTEAL_DIRECT))); > > #ifdef CONFIG_HUGETLB_PAGE > update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 762609d5f0af..fc39c107a4b5 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -255,6 +255,18 @@ enum node_stat_item { > PGDEMOTE_DIRECT, > PGDEMOTE_KHUGEPAGED, > PGDEMOTE_PROACTIVE, > + PGSTEAL_KSWAPD, > + PGSTEAL_DIRECT, > + PGSTEAL_KHUGEPAGED, > + PGSTEAL_PROACTIVE, > + PGSTEAL_ANON, > + PGSTEAL_FILE, > + PGSCAN_KSWAPD, > + PGSCAN_DIRECT, > + PGSCAN_KHUGEPAGED, > + PGSCAN_PROACTIVE, > + PGSCAN_ANON, > + PGSCAN_FILE, > #ifdef CONFIG_NUMA > PGALLOC_MPOL_DEFAULT, > PGALLOC_MPOL_PREFERRED, > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 92f80b4d69a6..6f1787680658 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -40,19 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > PGLAZYFREED, > PGREFILL, > PGREUSE, > - PGSTEAL_KSWAPD, > - PGSTEAL_DIRECT, > - PGSTEAL_KHUGEPAGED, > - PGSTEAL_PROACTIVE, > - PGSCAN_KSWAPD, > - PGSCAN_DIRECT, > - PGSCAN_KHUGEPAGED, > - PGSCAN_PROACTIVE, > PGSCAN_DIRECT_THROTTLE, > - PGSCAN_ANON, > - PGSCAN_FILE, > - PGSTEAL_ANON, > - PGSTEAL_FILE, > #ifdef CONFIG_NUMA > PGSCAN_ZONE_RECLAIM_SUCCESS, > PGSCAN_ZONE_RECLAIM_FAILED, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 86f43b7e5f71..bde0b6536be6 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -328,6 +328,18 @@ static const unsigned int memcg_node_stat_items[] = { > PGDEMOTE_DIRECT, > PGDEMOTE_KHUGEPAGED, > PGDEMOTE_PROACTIVE, > + PGSTEAL_KSWAPD, > + PGSTEAL_DIRECT, > + PGSTEAL_KHUGEPAGED, > + PGSTEAL_PROACTIVE, > + PGSTEAL_ANON, > + PGSTEAL_FILE, > + PGSCAN_KSWAPD, > + PGSCAN_DIRECT, > + PGSCAN_KHUGEPAGED, > + PGSCAN_PROACTIVE, > + PGSCAN_ANON, > + PGSCAN_FILE, > #ifdef CONFIG_HUGETLB_PAGE > NR_HUGETLB, > #endif > @@ -441,14 +453,6 @@ static const unsigned int memcg_vm_event_stat[] = { > #endif > PSWPIN, > PSWPOUT, > - PGSCAN_KSWAPD, > - PGSCAN_DIRECT, > - PGSCAN_KHUGEPAGED, > - PGSCAN_PROACTIVE, > - PGSTEAL_KSWAPD, > - PGSTEAL_DIRECT, > - PGSTEAL_KHUGEPAGED, > - PGSTEAL_PROACTIVE, > PGFAULT, > PGMAJFAULT, > PGREFILL, > @@ -1496,15 +1500,15 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) > > /* Accumulated memory events */ > seq_buf_printf(s, "pgscan %lu\n", > - memcg_events(memcg, PGSCAN_KSWAPD) + > - memcg_events(memcg, PGSCAN_DIRECT) + > - memcg_events(memcg, PGSCAN_PROACTIVE) + > - memcg_events(memcg, PGSCAN_KHUGEPAGED)); > + memcg_page_state(memcg, PGSCAN_KSWAPD) + > + memcg_page_state(memcg, PGSCAN_DIRECT) + > + memcg_page_state(memcg, PGSCAN_PROACTIVE) + > + memcg_page_state(memcg, PGSCAN_KHUGEPAGED)); > seq_buf_printf(s, "pgsteal %lu\n", > - memcg_events(memcg, PGSTEAL_KSWAPD) + > - memcg_events(memcg, PGSTEAL_DIRECT) + > - memcg_events(memcg, PGSTEAL_PROACTIVE) + > - memcg_events(memcg, PGSTEAL_KHUGEPAGED)); > + memcg_page_state(memcg, PGSTEAL_KSWAPD) + > + memcg_page_state(memcg, PGSTEAL_DIRECT) + > + memcg_page_state(memcg, PGSTEAL_PROACTIVE) + > + memcg_page_state(memcg, PGSTEAL_KHUGEPAGED)); > > for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { > #ifdef CONFIG_MEMCG_V1 > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 614ccf39fe3f..16a0f21e3ea1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1977,7 +1977,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > unsigned long nr_taken; > struct reclaim_stat stat; > bool file = is_file_lru(lru); > - enum vm_event_item item; > + enum node_stat_item item; > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > bool stalled = false; > > @@ -2003,10 +2003,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > > __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); > item = PGSCAN_KSWAPD + reclaimer_offset(sc); > - if (!cgroup_reclaim(sc)) > - __count_vm_events(item, nr_scanned); > - count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); > - __count_vm_events(PGSCAN_ANON + file, nr_scanned); > + mod_lruvec_state(lruvec, item, nr_scanned); > + mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned); > > spin_unlock_irq(&lruvec->lru_lock); > > @@ -2023,10 +2021,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > stat.nr_demoted); > __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); > item = PGSTEAL_KSWAPD + reclaimer_offset(sc); > - if (!cgroup_reclaim(sc)) > - __count_vm_events(item, nr_reclaimed); > - count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); > - __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); > + mod_lruvec_state(lruvec, item, nr_reclaimed); > + mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); > > lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, > nr_scanned - nr_reclaimed); > @@ -4536,7 +4532,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > { > int i; > int gen; > - enum vm_event_item item; > + enum node_stat_item item; > int sorted = 0; > int scanned = 0; > int isolated = 0; > @@ -4595,13 +4591,11 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > } > > item = PGSCAN_KSWAPD + reclaimer_offset(sc); > - if (!cgroup_reclaim(sc)) { > - __count_vm_events(item, isolated); > + if (!cgroup_reclaim(sc)) > __count_vm_events(PGREFILL, sorted); > - } > - count_memcg_events(memcg, item, isolated); > + mod_lruvec_state(lruvec, item, isolated); > count_memcg_events(memcg, PGREFILL, sorted); > - __count_vm_events(PGSCAN_ANON + type, isolated); > + mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated); > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch, > scanned, skipped, isolated, > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); > @@ -4686,7 +4680,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > LIST_HEAD(clean); > struct folio *folio; > struct folio *next; > - enum vm_event_item item; > + enum node_stat_item item; > struct reclaim_stat stat; > struct lru_gen_mm_walk *walk; > bool skip_retry = false; > @@ -4750,10 +4744,8 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > stat.nr_demoted); > > item = PGSTEAL_KSWAPD + reclaimer_offset(sc); > - if (!cgroup_reclaim(sc)) > - __count_vm_events(item, reclaimed); > - count_memcg_events(memcg, item, reclaimed); > - __count_vm_events(PGSTEAL_ANON + type, reclaimed); > + mod_lruvec_state(lruvec, item, reclaimed); > + mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed); > > spin_unlock_irq(&lruvec->lru_lock); > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 74e0ddde1e93..e4b259989d58 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1291,6 +1291,18 @@ const char * const vmstat_text[] = { > [I(PGDEMOTE_DIRECT)] = "pgdemote_direct", > [I(PGDEMOTE_KHUGEPAGED)] = "pgdemote_khugepaged", > [I(PGDEMOTE_PROACTIVE)] = "pgdemote_proactive", > + [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", > + [I(PGSTEAL_DIRECT)] = "pgsteal_direct", > + [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", > + [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", > + [I(PGSTEAL_ANON)] = "pgsteal_anon", > + [I(PGSTEAL_FILE)] = "pgsteal_file", > + [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", > + [I(PGSCAN_DIRECT)] = "pgscan_direct", > + [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", > + [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", > + [I(PGSCAN_ANON)] = "pgscan_anon", > + [I(PGSCAN_FILE)] = "pgscan_file", > #ifdef CONFIG_NUMA > [I(PGALLOC_MPOL_DEFAULT)] = "pgalloc_mpol_default", > [I(PGALLOC_MPOL_PREFERRED)] = "pgalloc_mpol_preferred", > @@ -1344,19 +1356,7 @@ const char * const vmstat_text[] = { > > [I(PGREFILL)] = "pgrefill", > [I(PGREUSE)] = "pgreuse", > - [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", > - [I(PGSTEAL_DIRECT)] = "pgsteal_direct", > - [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", > - [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", > - [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", > - [I(PGSCAN_DIRECT)] = "pgscan_direct", > - [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", > - [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", > [I(PGSCAN_DIRECT_THROTTLE)] = "pgscan_direct_throttle", > - [I(PGSCAN_ANON)] = "pgscan_anon", > - [I(PGSCAN_FILE)] = "pgscan_file", > - [I(PGSTEAL_ANON)] = "pgsteal_anon", > - [I(PGSTEAL_FILE)] = "pgsteal_file", > > #ifdef CONFIG_NUMA > [I(PGSCAN_ZONE_RECLAIM_SUCCESS)] = "zone_reclaim_success", > -- > 2.47.3 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] mm: move pgscan and pgsteal to node stats 2026-02-12 7:08 ` Michael S. Tsirkin @ 2026-02-12 21:23 ` JP Kobryn 0 siblings, 0 replies; 27+ messages in thread From: JP Kobryn @ 2026-02-12 21:23 UTC (permalink / raw) To: Michael S. Tsirkin Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/11/26 11:08 PM, Michael S. Tsirkin wrote: > On Wed, Feb 11, 2026 at 08:51:09PM -0800, JP Kobryn wrote: >> It would be useful to narrow down reclaim to specific nodes. >> >> Provide per-node reclaim visibility by changing the pgscan and pgsteal >> stats from global vm_event_item's to node_stat_item's. Note this change has >> the side effect of now tracking these stats on a per-memcg basis. >> >> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> >> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > > virtio_balloon changes Thanks. I'll make sure to spell this out in the v2 changelog. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] mm: move pgscan and pgsteal to node stats 2026-02-12 4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn 2026-02-12 7:08 ` Michael S. Tsirkin @ 2026-02-12 7:29 ` Michal Hocko 2026-02-12 21:20 ` JP Kobryn 1 sibling, 1 reply; 27+ messages in thread From: Michal Hocko @ 2026-02-12 7:29 UTC (permalink / raw) To: JP Kobryn Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Wed 11-02-26 20:51:09, JP Kobryn wrote: > It would be useful to narrow down reclaim to specific nodes. > > Provide per-node reclaim visibility by changing the pgscan and pgsteal > stats from global vm_event_item's to node_stat_item's. Note this change has > the side effect of now tracking these stats on a per-memcg basis. The changelog could have been more clear about the actual changes as this is not overly clear for untrained eyes. The most important parts are that /proc/vmstat will preserve reclaim stats with slightly different counters ordering (shouldn't break userspace much^W), per-node stats will be now newly displayed in /proc/zoneinfo - this is presumably the primary motivation to have a better insight of per-node reclaim activity, and memcg stats will now show their share of the global memory reclaim. Have I missed anything? > Signed-off-by: JP Kobryn <inwardvessel@gmail.com> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > --- > drivers/virtio/virtio_balloon.c | 8 ++++---- > include/linux/mmzone.h | 12 +++++++++++ > include/linux/vm_event_item.h | 12 ----------- > mm/memcontrol.c | 36 ++++++++++++++++++--------------- > mm/vmscan.c | 32 +++++++++++------------------ > mm/vmstat.c | 24 +++++++++++----------- > 6 files changed, 60 insertions(+), 64 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index 74fe59f5a78c..1341d9d1a2a1 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -374,13 +374,13 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb) > update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); > > update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN, > - pages_to_bytes(events[PGSCAN_KSWAPD])); > + pages_to_bytes(global_node_page_state(PGSCAN_KSWAPD))); > update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN, > - pages_to_bytes(events[PGSCAN_DIRECT])); > + pages_to_bytes(global_node_page_state(PGSCAN_DIRECT))); > update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM, > - pages_to_bytes(events[PGSTEAL_KSWAPD])); > + pages_to_bytes(global_node_page_state(PGSTEAL_KSWAPD))); > update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM, > - pages_to_bytes(events[PGSTEAL_DIRECT])); > + pages_to_bytes(global_node_page_state(PGSTEAL_DIRECT))); > > #ifdef CONFIG_HUGETLB_PAGE > update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 762609d5f0af..fc39c107a4b5 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -255,6 +255,18 @@ enum node_stat_item { > PGDEMOTE_DIRECT, > PGDEMOTE_KHUGEPAGED, > PGDEMOTE_PROACTIVE, > + PGSTEAL_KSWAPD, > + PGSTEAL_DIRECT, > + PGSTEAL_KHUGEPAGED, > + PGSTEAL_PROACTIVE, > + PGSTEAL_ANON, > + PGSTEAL_FILE, > + PGSCAN_KSWAPD, > + PGSCAN_DIRECT, > + PGSCAN_KHUGEPAGED, > + PGSCAN_PROACTIVE, > + PGSCAN_ANON, > + PGSCAN_FILE, > #ifdef CONFIG_NUMA > PGALLOC_MPOL_DEFAULT, > PGALLOC_MPOL_PREFERRED, > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 92f80b4d69a6..6f1787680658 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -40,19 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > PGLAZYFREED, > PGREFILL, > PGREUSE, > - PGSTEAL_KSWAPD, > - PGSTEAL_DIRECT, > - PGSTEAL_KHUGEPAGED, > - PGSTEAL_PROACTIVE, > - PGSCAN_KSWAPD, > - PGSCAN_DIRECT, > - PGSCAN_KHUGEPAGED, > - PGSCAN_PROACTIVE, > PGSCAN_DIRECT_THROTTLE, > - PGSCAN_ANON, > - PGSCAN_FILE, > - PGSTEAL_ANON, > - PGSTEAL_FILE, > #ifdef CONFIG_NUMA > PGSCAN_ZONE_RECLAIM_SUCCESS, > PGSCAN_ZONE_RECLAIM_FAILED, > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 86f43b7e5f71..bde0b6536be6 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -328,6 +328,18 @@ static const unsigned int memcg_node_stat_items[] = { > PGDEMOTE_DIRECT, > PGDEMOTE_KHUGEPAGED, > PGDEMOTE_PROACTIVE, > + PGSTEAL_KSWAPD, > + PGSTEAL_DIRECT, > + PGSTEAL_KHUGEPAGED, > + PGSTEAL_PROACTIVE, > + PGSTEAL_ANON, > + PGSTEAL_FILE, > + PGSCAN_KSWAPD, > + PGSCAN_DIRECT, > + PGSCAN_KHUGEPAGED, > + PGSCAN_PROACTIVE, > + PGSCAN_ANON, > + PGSCAN_FILE, > #ifdef CONFIG_HUGETLB_PAGE > NR_HUGETLB, > #endif > @@ -441,14 +453,6 @@ static const unsigned int memcg_vm_event_stat[] = { > #endif > PSWPIN, > PSWPOUT, > - PGSCAN_KSWAPD, > - PGSCAN_DIRECT, > - PGSCAN_KHUGEPAGED, > - PGSCAN_PROACTIVE, > - PGSTEAL_KSWAPD, > - PGSTEAL_DIRECT, > - PGSTEAL_KHUGEPAGED, > - PGSTEAL_PROACTIVE, > PGFAULT, > PGMAJFAULT, > PGREFILL, > @@ -1496,15 +1500,15 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) > > /* Accumulated memory events */ > seq_buf_printf(s, "pgscan %lu\n", > - memcg_events(memcg, PGSCAN_KSWAPD) + > - memcg_events(memcg, PGSCAN_DIRECT) + > - memcg_events(memcg, PGSCAN_PROACTIVE) + > - memcg_events(memcg, PGSCAN_KHUGEPAGED)); > + memcg_page_state(memcg, PGSCAN_KSWAPD) + > + memcg_page_state(memcg, PGSCAN_DIRECT) + > + memcg_page_state(memcg, PGSCAN_PROACTIVE) + > + memcg_page_state(memcg, PGSCAN_KHUGEPAGED)); > seq_buf_printf(s, "pgsteal %lu\n", > - memcg_events(memcg, PGSTEAL_KSWAPD) + > - memcg_events(memcg, PGSTEAL_DIRECT) + > - memcg_events(memcg, PGSTEAL_PROACTIVE) + > - memcg_events(memcg, PGSTEAL_KHUGEPAGED)); > + memcg_page_state(memcg, PGSTEAL_KSWAPD) + > + memcg_page_state(memcg, PGSTEAL_DIRECT) + > + memcg_page_state(memcg, PGSTEAL_PROACTIVE) + > + memcg_page_state(memcg, PGSTEAL_KHUGEPAGED)); > > for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) { > #ifdef CONFIG_MEMCG_V1 > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 614ccf39fe3f..16a0f21e3ea1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1977,7 +1977,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > unsigned long nr_taken; > struct reclaim_stat stat; > bool file = is_file_lru(lru); > - enum vm_event_item item; > + enum node_stat_item item; > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > bool stalled = false; > > @@ -2003,10 +2003,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > > __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); > item = PGSCAN_KSWAPD + reclaimer_offset(sc); > - if (!cgroup_reclaim(sc)) > - __count_vm_events(item, nr_scanned); > - count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned); > - __count_vm_events(PGSCAN_ANON + file, nr_scanned); > + mod_lruvec_state(lruvec, item, nr_scanned); > + mod_lruvec_state(lruvec, PGSCAN_ANON + file, nr_scanned); > > spin_unlock_irq(&lruvec->lru_lock); > > @@ -2023,10 +2021,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan, > stat.nr_demoted); > __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); > item = PGSTEAL_KSWAPD + reclaimer_offset(sc); > - if (!cgroup_reclaim(sc)) > - __count_vm_events(item, nr_reclaimed); > - count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed); > - __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed); > + mod_lruvec_state(lruvec, item, nr_reclaimed); > + mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed); > > lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout, > nr_scanned - nr_reclaimed); > @@ -4536,7 +4532,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > { > int i; > int gen; > - enum vm_event_item item; > + enum node_stat_item item; > int sorted = 0; > int scanned = 0; > int isolated = 0; > @@ -4595,13 +4591,11 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > } > > item = PGSCAN_KSWAPD + reclaimer_offset(sc); > - if (!cgroup_reclaim(sc)) { > - __count_vm_events(item, isolated); > + if (!cgroup_reclaim(sc)) > __count_vm_events(PGREFILL, sorted); > - } > - count_memcg_events(memcg, item, isolated); > + mod_lruvec_state(lruvec, item, isolated); > count_memcg_events(memcg, PGREFILL, sorted); > - __count_vm_events(PGSCAN_ANON + type, isolated); > + mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated); > trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch, > scanned, skipped, isolated, > type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON); > @@ -4686,7 +4680,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > LIST_HEAD(clean); > struct folio *folio; > struct folio *next; > - enum vm_event_item item; > + enum node_stat_item item; > struct reclaim_stat stat; > struct lru_gen_mm_walk *walk; > bool skip_retry = false; > @@ -4750,10 +4744,8 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > stat.nr_demoted); > > item = PGSTEAL_KSWAPD + reclaimer_offset(sc); > - if (!cgroup_reclaim(sc)) > - __count_vm_events(item, reclaimed); > - count_memcg_events(memcg, item, reclaimed); > - __count_vm_events(PGSTEAL_ANON + type, reclaimed); > + mod_lruvec_state(lruvec, item, reclaimed); > + mod_lruvec_state(lruvec, PGSTEAL_ANON + type, reclaimed); > > spin_unlock_irq(&lruvec->lru_lock); > > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 74e0ddde1e93..e4b259989d58 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1291,6 +1291,18 @@ const char * const vmstat_text[] = { > [I(PGDEMOTE_DIRECT)] = "pgdemote_direct", > [I(PGDEMOTE_KHUGEPAGED)] = "pgdemote_khugepaged", > [I(PGDEMOTE_PROACTIVE)] = "pgdemote_proactive", > + [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", > + [I(PGSTEAL_DIRECT)] = "pgsteal_direct", > + [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", > + [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", > + [I(PGSTEAL_ANON)] = "pgsteal_anon", > + [I(PGSTEAL_FILE)] = "pgsteal_file", > + [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", > + [I(PGSCAN_DIRECT)] = "pgscan_direct", > + [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", > + [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", > + [I(PGSCAN_ANON)] = "pgscan_anon", > + [I(PGSCAN_FILE)] = "pgscan_file", > #ifdef CONFIG_NUMA > [I(PGALLOC_MPOL_DEFAULT)] = "pgalloc_mpol_default", > [I(PGALLOC_MPOL_PREFERRED)] = "pgalloc_mpol_preferred", > @@ -1344,19 +1356,7 @@ const char * const vmstat_text[] = { > > [I(PGREFILL)] = "pgrefill", > [I(PGREUSE)] = "pgreuse", > - [I(PGSTEAL_KSWAPD)] = "pgsteal_kswapd", > - [I(PGSTEAL_DIRECT)] = "pgsteal_direct", > - [I(PGSTEAL_KHUGEPAGED)] = "pgsteal_khugepaged", > - [I(PGSTEAL_PROACTIVE)] = "pgsteal_proactive", > - [I(PGSCAN_KSWAPD)] = "pgscan_kswapd", > - [I(PGSCAN_DIRECT)] = "pgscan_direct", > - [I(PGSCAN_KHUGEPAGED)] = "pgscan_khugepaged", > - [I(PGSCAN_PROACTIVE)] = "pgscan_proactive", > [I(PGSCAN_DIRECT_THROTTLE)] = "pgscan_direct_throttle", > - [I(PGSCAN_ANON)] = "pgscan_anon", > - [I(PGSCAN_FILE)] = "pgscan_file", > - [I(PGSTEAL_ANON)] = "pgsteal_anon", > - [I(PGSTEAL_FILE)] = "pgsteal_file", > > #ifdef CONFIG_NUMA > [I(PGSCAN_ZONE_RECLAIM_SUCCESS)] = "zone_reclaim_success", > -- > 2.47.3 > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/2] mm: move pgscan and pgsteal to node stats 2026-02-12 7:29 ` Michal Hocko @ 2026-02-12 21:20 ` JP Kobryn 0 siblings, 0 replies; 27+ messages in thread From: JP Kobryn @ 2026-02-12 21:20 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/11/26 11:29 PM, Michal Hocko wrote: > On Wed 11-02-26 20:51:09, JP Kobryn wrote: >> It would be useful to narrow down reclaim to specific nodes. >> >> Provide per-node reclaim visibility by changing the pgscan and pgsteal >> stats from global vm_event_item's to node_stat_item's. Note this change has >> the side effect of now tracking these stats on a per-memcg basis. > > The changelog could have been more clear about the actual changes as > this is not overly clear for untrained eyes. The most important parts > are that /proc/vmstat will preserve reclaim stats with slightly > different counters ordering (shouldn't break userspace much^W), per-node > stats will be now newly displayed in /proc/zoneinfo - this is presumably > the primary motivation to have a better insight of per-node reclaim > activity, and memcg stats will now show their share of the global memory > reclaim. > > Have I missed anything? That's accurate. Plus aside from reading /proc/zoneinfo they will also be in /sys/devices/system/node/nodeN/vmstat. I see I could have been more explicit about this. Let me make additions to the changelog in v2. Thanks for taking a look. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/2] improve per-node allocation and reclaim visibility 2026-02-12 4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn 2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn 2026-02-12 4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn @ 2026-02-12 4:57 ` Matthew Wilcox 2026-02-12 21:22 ` JP Kobryn 2026-02-12 18:08 ` [syzbot ci] " syzbot ci 3 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2026-02-12 4:57 UTC (permalink / raw) To: JP Kobryn Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Wed, Feb 11, 2026 at 08:51:07PM -0800, JP Kobryn wrote: > We sometimes find ourselves in situations where reclaim kicks in, yet there who is we? you haven't indicated any affiliation in your tags. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/2] improve per-node allocation and reclaim visibility 2026-02-12 4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox @ 2026-02-12 21:22 ` JP Kobryn 2026-02-12 21:53 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: JP Kobryn @ 2026-02-12 21:22 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On 2/11/26 8:57 PM, Matthew Wilcox wrote: > On Wed, Feb 11, 2026 at 08:51:07PM -0800, JP Kobryn wrote: >> We sometimes find ourselves in situations where reclaim kicks in, yet there > > who is we? you haven't indicated any affiliation in your tags. Meta. Is there a preferred way of indicating this? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/2] improve per-node allocation and reclaim visibility 2026-02-12 21:22 ` JP Kobryn @ 2026-02-12 21:53 ` Matthew Wilcox 0 siblings, 0 replies; 27+ messages in thread From: Matthew Wilcox @ 2026-02-12 21:53 UTC (permalink / raw) To: JP Kobryn Cc: linux-mm, apopple, akpm, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, jasowang, hannes, joshua.hahnjy, Liam.Howlett, linux-kernel, lorenzo.stoakes, matthew.brost, mst, mhocko, rppt, muchun.song, zhengqi.arch, rakie.kim, roman.gushchin, shakeel.butt, surenb, virtualization, vbabka, weixugc, xuanzhuo, ying.huang, yuanchu, ziy, kernel-team On Thu, Feb 12, 2026 at 01:22:09PM -0800, JP Kobryn wrote: > On 2/11/26 8:57 PM, Matthew Wilcox wrote: > > On Wed, Feb 11, 2026 at 08:51:07PM -0800, JP Kobryn wrote: > > > We sometimes find ourselves in situations where reclaim kicks in, yet there > > > > who is we? you haven't indicated any affiliation in your tags. > > Meta. Is there a preferred way of indicating this? Documentation/process/submitting-patches.rst: From Line ^^^^^^^^^ The ``from`` line must be the very first line in the message body, and has the form: From: Patch Author <author@example.com> The ``from`` line specifies who will be credited as the author of the patch in the permanent changelog. If the ``from`` line is missing, then the ``From:`` line from the email header will be used to determine the patch author in the changelog. The author may indicate their affiliation or the sponsor of the work by adding the name of an organization to the ``from`` and ``SoB`` lines, e.g.: From: Patch Author (Company) <author@example.com> I do this with ~/.gitconfig [user] name = Matthew Wilcox (Oracle) email = willy@infradead.org and it goes into the From and Signed-off-by lines correctly when generating patches. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [syzbot ci] Re: improve per-node allocation and reclaim visibility 2026-02-12 4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn ` (2 preceding siblings ...) 2026-02-12 4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox @ 2026-02-12 18:08 ` syzbot ci 3 siblings, 0 replies; 27+ messages in thread From: syzbot ci @ 2026-02-12 18:08 UTC (permalink / raw) To: akpm, apopple, axelrasmussen, byungchul, cgroups, david, eperezma, gourry, hannes, inwardvessel, jasowang, joshua.hahnjy, kernel-team, liam.howlett, linux-kernel, linux-mm, lorenzo.stoakes, matthew.brost, mhocko, mst, muchun.song, rakie.kim, roman.gushchin, rppt, shakeel.butt, surenb, vbabka, virtualization, weixugc, xuanzhuo, ying.huang, yuanchu, zhengqi.arch, ziy Cc: syzbot, syzkaller-bugs syzbot ci has tested the following series [v1] improve per-node allocation and reclaim visibility https://lore.kernel.org/all/20260212045109.255391-1-inwardvessel@gmail.com * [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy * [PATCH 2/2] mm: move pgscan and pgsteal to node stats and found the following issue: WARNING in __mod_node_page_state Full report is available here: https://ci.syzbot.org/series/4ec12ede-3298-43a3-ab6b-79d47759672e *** WARNING in __mod_node_page_state tree: mm-new URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git base: 72a46cdd4ef13690beb8c5a2f6a2023fd7ef2eb4 arch: amd64 compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 config: https://ci.syzbot.org/builds/0f678e4c-a4ba-4f17-8ed7-8ae99e56a463/config ------------[ cut here ]------------ IS_ENABLED(CONFIG_PREEMPT_COUNT) && __lockdep_enabled && (preempt_count() == 0 && this_cpu_read(hardirqs_enabled)) WARNING: mm/vmstat.c:396 at __mod_node_page_state+0x126/0x170, CPU#0: kthreadd/2 Modules linked in: CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:__mod_node_page_state+0x126/0x170 Code: 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 48 89 df 4c 89 e6 44 89 fa e8 68 00 00 00 31 db eb cc 90 0f 0b 90 e9 3e ff ff ff 90 <0f> 0b 90 eb 80 48 c7 c7 e0 c6 64 8e 4c 89 f6 e8 66 3c d3 02 e9 28 RSP: 0000:ffffc900000773d0 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 000000000000003d RDI: ffff88815fffb380 RBP: dffffc0000000000 R08: ffffffff8fef2977 R09: 1ffffffff1fde52e R10: dffffc0000000000 R11: fffffbfff1fde52f R12: ffff88815fffb380 R13: ffffffff92f50f00 R14: 000000000000003d R15: 000000000000003d FS: 0000000000000000(0000) GS:ffff88818e0f0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88823ffff000 CR3: 000000000e346000 CR4: 00000000000006f0 Call Trace: <TASK> alloc_pages_mpol+0x407/0x740 alloc_pages_noprof+0xa8/0x190 get_free_pages_noprof+0xf/0x80 __kasan_populate_vmalloc+0x38/0x1d0 alloc_vmap_area+0xd21/0x1460 __get_vm_area_node+0x1f8/0x300 __vmalloc_node_range_noprof+0x372/0x1730 __vmalloc_node_noprof+0xc2/0x100 dup_task_struct+0x228/0x9a0 copy_process+0x508/0x3980 kernel_clone+0x248/0x870 kernel_thread+0x13f/0x1b0 kthreadd+0x4f9/0x6f0 ret_from_fork+0x51b/0xa40 ret_from_fork_asm+0x1a/0x30 </TASK> *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com. ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2026-02-18 4:25 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-02-12 4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn 2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn 2026-02-12 7:29 ` Michal Hocko 2026-02-12 21:22 ` JP Kobryn 2026-02-16 8:26 ` Michal Hocko 2026-02-16 17:50 ` JP Kobryn (Meta) 2026-02-16 21:07 ` Michal Hocko 2026-02-17 7:48 ` JP Kobryn (Meta) 2026-02-17 12:37 ` Michal Hocko 2026-02-17 18:19 ` JP Kobryn (Meta) 2026-02-17 18:52 ` Michal Hocko 2026-02-12 15:07 ` Shakeel Butt 2026-02-12 21:23 ` JP Kobryn 2026-02-12 15:24 ` Vlastimil Babka 2026-02-12 21:25 ` JP Kobryn 2026-02-13 8:54 ` Vlastimil Babka 2026-02-13 19:56 ` JP Kobryn (Meta) 2026-02-18 4:25 ` kernel test robot 2026-02-12 4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn 2026-02-12 7:08 ` Michael S. Tsirkin 2026-02-12 21:23 ` JP Kobryn 2026-02-12 7:29 ` Michal Hocko 2026-02-12 21:20 ` JP Kobryn 2026-02-12 4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox 2026-02-12 21:22 ` JP Kobryn 2026-02-12 21:53 ` Matthew Wilcox 2026-02-12 18:08 ` [syzbot ci] " syzbot ci
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox