From: "JP Kobryn (Meta)" <inwardvessel@gmail.com>
To: Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, apopple@nvidia.com,
akpm@linux-foundation.org, axelrasmussen@google.com,
byungchul@sk.com, cgroups@vger.kernel.org, david@kernel.org,
eperezma@redhat.com, gourry@gourry.net, jasowang@redhat.com,
hannes@cmpxchg.org, joshua.hahnjy@gmail.com,
Liam.Howlett@oracle.com, linux-kernel@vger.kernel.org,
lorenzo.stoakes@oracle.com, matthew.brost@intel.com,
mst@redhat.com, rppt@kernel.org, muchun.song@linux.dev,
zhengqi.arch@bytedance.com, rakie.kim@sk.com,
roman.gushchin@linux.dev, shakeel.butt@linux.dev,
surenb@google.com, virtualization@lists.linux.dev,
vbabka@suse.cz, weixugc@google.com, xuanzhuo@linux.alibaba.com,
ying.huang@linux.alibaba.com, yuanchu@google.com, ziy@nvidia.com,
kernel-team@meta.com
Subject: Re: [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy
Date: Mon, 16 Feb 2026 23:48:42 -0800 [thread overview]
Message-ID: <9ae80317-f005-474c-9da1-95462138f3c6@gmail.com> (raw)
In-Reply-To: <aZOHIQj3pJ-9dW_0@tiehlicka>
On 2/16/26 1:07 PM, Michal Hocko wrote:
> On Mon 16-02-26 09:50:26, JP Kobryn (Meta) wrote:
>> On 2/16/26 12:26 AM, Michal Hocko wrote:
>>> On Thu 12-02-26 13:22:56, JP Kobryn wrote:
>>>> On 2/11/26 11:29 PM, Michal Hocko wrote:
>>>>> On Wed 11-02-26 20:51:08, JP Kobryn wrote:
>>>>>> It would be useful to see a breakdown of allocations to understand which
>>>>>> NUMA policies are driving them. For example, when investigating memory
>>>>>> pressure, having policy-specific counts could show that allocations were
>>>>>> bound to the affected node (via MPOL_BIND).
>>>>>>
>>>>>> Add per-policy page allocation counters as new node stat items. These
>>>>>> counters can provide correlation between a mempolicy and pressure on a
>>>>>> given node.
>>>>>
>>>>> Could you be more specific how exactly do you plan to use those
>>>>> counters?
>>>>
>>>> Yes. Patch 2 allows us to find which nodes are undergoing reclaim. Once
>>>> we identify the affected node(s), the new mpol counters (this patch)
>>>> allow us correlate the pressure to the mempolicy driving it.
>>>
>>> I would appreciate somehow more specificity. You are adding counters
>>> that are not really easy to drop once they are in. Sure we have
>>> precedence of dropping some counters in the past so this is not as hard
>>> as usual userspace APIs but still...
>>>
>>> How exactly do you tolerate mempolicy allocations to specific nodes?
>>> While MPOL_MBIND is quite straightforward others are less so.
>>
>> The design does account for this regardless of the policy. In the call
>> to __mod_node_page_state(), I'm using page_pgdat(page) so the stat is
>> attributed to the node where the page actually landed.
>
> That much is clear[*]. The consumer side of things is not really clear to
> me. How do you know which policy or part of the nodemask of that policy
> is the source of the memory pressure on a particular node? In other
> words how much is the data actually useful except for a single node
> mempolicy (i.e. MBIND).
Other than the bind policy, having the interleave (and weighted) stats
would allow us to see the effective distribution of the policy. Pressure
could be linked to a user configured weight scheme. I would think it
could also help with confirming expected distributions.
You brought up the node mask so with the preferred policy, I think this
is a good one for using the counters as well. Once we're at the point
where we know the node(s) under pressure and then see significant
preferred allocs accounted for, we could search the numa_maps that have
"prefer:<node>" to find the tasks targeting the affected nodes.
I mentioned this on another thread in this series but I'll include here
as well and expand some more. For any given policy, the workflow would
be:
1) Pressure/OOMs reported while system-wide memory is free.
2) Check per-node pgscan/pgsteal stats (provided by patch 2) to narrow
down node(s) under pressure. They become available in
/sys/devices/system/node/nodeN/vmstat.
3) Check per-policy allocation counters (this patch) on that node to
find what policy was driving it. Same readout at nodeN/vmstat.
4) Now use /proc/*/numa_maps to identify tasks using the policy.
>
> [*] btw. I believe you misaccount MPOL_LOCAL because you attribute the
> target node even when the allocation is from a remote node from the
> "local" POV.
It's a good point. The accounting as a result of fallback cases
shouldn't detract from an investigation though. We're interested in the
node(s) under pressure so the relatively few fallback allocations would
land on nodes that are not under pressure and could be viewed as
acceptable noise.
next prev parent reply other threads:[~2026-02-17 7:48 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-12 4:51 [PATCH 0/2] improve per-node allocation and reclaim visibility JP Kobryn
2026-02-12 4:51 ` [PATCH 1/2] mm/mempolicy: track page allocations per mempolicy JP Kobryn
2026-02-12 7:29 ` Michal Hocko
2026-02-12 21:22 ` JP Kobryn
2026-02-16 8:26 ` Michal Hocko
2026-02-16 17:50 ` JP Kobryn (Meta)
2026-02-16 21:07 ` Michal Hocko
2026-02-17 7:48 ` JP Kobryn (Meta) [this message]
2026-02-17 12:37 ` Michal Hocko
2026-02-17 18:19 ` JP Kobryn (Meta)
2026-02-17 18:52 ` Michal Hocko
2026-02-12 15:07 ` Shakeel Butt
2026-02-12 21:23 ` JP Kobryn
2026-02-12 15:24 ` Vlastimil Babka
2026-02-12 21:25 ` JP Kobryn
2026-02-13 8:54 ` Vlastimil Babka
2026-02-13 19:56 ` JP Kobryn (Meta)
2026-02-18 4:25 ` kernel test robot
2026-02-12 4:51 ` [PATCH 2/2] mm: move pgscan and pgsteal to node stats JP Kobryn
2026-02-12 7:08 ` Michael S. Tsirkin
2026-02-12 21:23 ` JP Kobryn
2026-02-12 7:29 ` Michal Hocko
2026-02-12 21:20 ` JP Kobryn
2026-02-12 4:57 ` [PATCH 0/2] improve per-node allocation and reclaim visibility Matthew Wilcox
2026-02-12 21:22 ` JP Kobryn
2026-02-12 21:53 ` Matthew Wilcox
2026-02-12 18:08 ` [syzbot ci] " syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9ae80317-f005-474c-9da1-95462138f3c6@gmail.com \
--to=inwardvessel@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=axelrasmussen@google.com \
--cc=byungchul@sk.com \
--cc=cgroups@vger.kernel.org \
--cc=david@kernel.org \
--cc=eperezma@redhat.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=jasowang@redhat.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=matthew.brost@intel.com \
--cc=mhocko@suse.com \
--cc=mst@redhat.com \
--cc=muchun.song@linux.dev \
--cc=rakie.kim@sk.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
--cc=virtualization@lists.linux.dev \
--cc=weixugc@google.com \
--cc=xuanzhuo@linux.alibaba.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox