On Mon 08-09-25 16:16:38, Jinjiang Tu wrote:在 2025/9/8 15:46, Michal Hocko 写道:On Sat 06-09-25 09:56:16, Jinjiang Tu wrote:In our use case, movable nodes are in all cpusets, so that movable nodes can be used by all tasks. Even though we move tasks into cpusets that only allow to allocate from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for all tasks.Right but this is because you allowed _all_ tasks to allocate from those movable nodes so why would that be an unexpected behavior?Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask. Like the following: diff --git a/mm/mempolicy.c b/mm/mempolicy.c index eb83cff7db8c..e56b6de836a6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk, if (!mask) return ret; + if (!nodes_intersects(*oc->nodemask, node_states[N_CPU])) + ret = false; +Nope, this doesn't really make much sense TBH. I believe you should stop special casing cpuless nodes and look into the actual configuration and check how to make cpuset based OOM tasks selection. Your underlying problem is not about no CPUs assigned to a numa node but an allocation constrain based on movability of allocations so you need to find a solution that is dealing with that constrain.Many tasks are in the root cpuset, systemd for example. The root cpuset contains all nodes, we couldn't exclude cpu-less nodes. If we reply on cpuset based OOM tasks selection, tasks in root cpuset may still be selected.If you start by killing tasks from the cpuset of the currently allocating task then this shouldn't really happen, right?
Do you mean we should put the tasks into the same cpuset, and then limit the max usage of the memcg, make it only trigger memcg OOM, to select tasks from the same memcg?