在 2025/9/8 17:11, Michal Hocko 写道:

On Mon 08-09-25 16:16:38, Jinjiang Tu wrote:

在 2025/9/8 15:46, Michal Hocko 写道:

On Sat 06-09-25 09:56:16, Jinjiang Tu wrote:

In our use case, movable nodes are in all cpusets, so that movable nodes can be
used by all tasks. Even though we move tasks into cpusets that only allow to allocate
from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for
all tasks.

Right but this is because you allowed _all_ tasks to allocate from those
movable nodes so why would that be an unexpected behavior?

Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask.
Like the following:

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eb83cff7db8c..e56b6de836a6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
         if (!mask)
                 return ret;
+       if (!nodes_intersects(*oc->nodemask, node_states[N_CPU]))
+               ret = false;
+

Nope, this doesn't really make much sense TBH. I believe you should stop
special casing cpuless nodes and look into the actual configuration and
check how to make cpuset based OOM tasks selection. Your underlying
problem is not about no CPUs assigned to a numa node but an allocation
constrain based on movability of allocations so you need to find a
solution that is dealing with that constrain.

Many tasks are in the root cpuset, systemd for example. The root cpuset
contains all nodes, we couldn't exclude cpu-less nodes.

If we reply on cpuset based OOM tasks selection, tasks in root cpuset may
still be selected.

If you start by killing tasks from the cpuset of the currently
allocating task then this shouldn't really happen, right?

Do you mean we should put the tasks into the same cpuset, and then limit the max usage
of the memcg, make it only trigger memcg OOM, to select tasks from the same memcg?