在 2025/9/4 22:43, Joshua Hahn 写道:
On Thu, 4 Sep 2025 16:36:28 +0200 Michal Hocko <mhocko@suse.com> wrote:

On Thu 04-09-25 07:26:25, Joshua Hahn wrote:
On Thu, 4 Sep 2025 21:44:31 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:

Hello Jinjiang,

I hope you are doing well, thank you for this patchset!

out_of_memory() selects tasks without considering mempolicy. Assuming a
cpu-less NUMA Node, ordinary process that don't set mempolicy don't
allocate memory from this cpu-less Node, unless other NUMA Nodes are below
low watermark. If a task binds to this cpu-less Node and triggers OOM, many
tasks may be killed wrongly that don't occupy memory from this Node.
I am wondeirng whether you have seen this happen in practice, or if this is
just based on inspecting the code. I have a feeling that the case you are
concerned about may already be covered in select_bad_process.

out_of_memory(oc)
    select_bad_process(oc)
        oom_evaluate_task(p, oc)
	    oom_cpuset_eligible(task, oc)
	    
	        [...snip...]

		for_each_thread(start, tsk) {
		    if (mask) {
		        ret = mempolicy_in_oom_domain(tsk, mask);
		    } else {
		        ret = cpuset_mems_allowed_intersects(current, tsk)
		    }
		}

While iterating through the list of candidate processes, we check whether
oc->nodemask exists, and if not, we check if the nodemasks intersects. It seems
like these are the two checks that you add in the helper function.

With that said, I might be missing something obvious -- please feel to
correct me if I am misunderstanding your patch or if I'm missing something
in the existing oom target selection : -)
The thing with mempolicy_in_oom_domain is that it doesn't really do what
you might be thinking it is doing ;) as it will true also for tasks
without any NUMA affinity because those intersect with the given mask by
definition as they can allocate from any node. So they are eligible and
that is what Jinjiang Tu is considered about I believe.
Hello Michal! Thank you for your insights : -)

Looking back, I made the mistake of thinking that we cared about the
!oc->nodemask case, where Jinjiang's patch cares about the oc->nodemask == True
case. So I was checking that cpuset_mems_allowed_intersects was the same as
nodes_intersects, whereas I should have been checking if mempolicy_in_oom_domain
is correct.
Most tasks don't mbind to specific nodes. In our use case, as described in the reply
to Michal, ordinary tasks are unlikely to allocate from these cpu-less NUMA Nodes.

Looking into it, everything you said is correct and I think I defintely
overlooked what the patch was trying to do. Thank you for clarifying these
points for me!

I hope you have a great day,
Joshua

-- 
Michal Hocko
SUSE Labs