[PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
@ 2025-09-04 13:44 Jinjiang Tu
  2025-09-04 14:25 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-04 13:44 UTC (permalink / raw)
  To: mhocko, rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm
  Cc: wangkefeng.wang, tujinjiang

out_of_memory() selects tasks without considering mempolicy. Assuming a
cpu-less NUMA Node, ordinary process that don't set mempolicy don't
allocate memory from this cpu-less Node, unless other NUMA Nodes are below
low watermark. If a task binds to this cpu-less Node and triggers OOM, many
tasks may be killed wrongly that don't occupy memory from this Node.

To fix it, only kill current if oc->nodemask are all nodes without any cpu.

Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
---
 mm/oom_kill.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..8ae4b2ecfe12 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1100,6 +1100,20 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+static bool should_oom_kill_allocating_task(struct oom_control *oc)
+{
+	if (sysctl_oom_kill_allocating_task)
+		return true;
+
+	if (!oc->nodemask)
+		return false;
+
+	if (nodes_intersects(*oc->nodemask, node_states[N_CPU]))
+		return false;
+
+	return true;
+}
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -1151,7 +1165,7 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc);
 
-	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
+	if (!is_memcg_oom(oc) && should_oom_kill_allocating_task(oc) &&
 	    current->mm && !oom_unkillable_task(current) &&
 	    oom_cpuset_eligible(current, oc) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-04 13:44 [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes Jinjiang Tu
@ 2025-09-04 14:25 ` Michal Hocko
  2025-09-05  1:56   ` Jinjiang Tu
  2025-09-05  9:13   ` Michal Hocko
  2025-09-04 14:26 ` Joshua Hahn
  2025-09-08 17:50 ` Gregory Price
  2 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2025-09-04 14:25 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
> out_of_memory() selects tasks without considering mempolicy. Assuming a
> cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> tasks may be killed wrongly that don't occupy memory from this Node.

I can see how a miconfigured task that binds _only_ to memoryless nodes
should be killed but this is not what the patch does, right?  Could you
tell us more about the specific situation? 

> To fix it, only kill current if oc->nodemask are all nodes without any cpu.
> 
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ---
>  mm/oom_kill.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..8ae4b2ecfe12 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1100,6 +1100,20 @@ int unregister_oom_notifier(struct notifier_block *nb)
>  }
>  EXPORT_SYMBOL_GPL(unregister_oom_notifier);
>  
> +static bool should_oom_kill_allocating_task(struct oom_control *oc)
> +{
> +	if (sysctl_oom_kill_allocating_task)
> +		return true;
> +
> +	if (!oc->nodemask)
> +		return false;
> +
> +	if (nodes_intersects(*oc->nodemask, node_states[N_CPU]))
> +		return false;
> +
> +	return true;
> +}
> +
>  /**
>   * out_of_memory - kill the "best" process when we run out of memory
>   * @oc: pointer to struct oom_control
> @@ -1151,7 +1165,7 @@ bool out_of_memory(struct oom_control *oc)
>  		oc->nodemask = NULL;
>  	check_panic_on_oom(oc);
>  
> -	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
> +	if (!is_memcg_oom(oc) && should_oom_kill_allocating_task(oc) &&
>  	    current->mm && !oom_unkillable_task(current) &&
>  	    oom_cpuset_eligible(current, oc) &&
>  	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> -- 
> 2.43.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-04 13:44 [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes Jinjiang Tu
  2025-09-04 14:25 ` Michal Hocko
@ 2025-09-04 14:26 ` Joshua Hahn
  2025-09-04 14:36   ` Michal Hocko
  2025-09-08 17:50 ` Gregory Price
  2 siblings, 1 reply; 21+ messages in thread
From: Joshua Hahn @ 2025-09-04 14:26 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: mhocko, rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Thu, 4 Sep 2025 21:44:31 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:

Hello Jinjiang,

I hope you are doing well, thank you for this patchset!

> out_of_memory() selects tasks without considering mempolicy. Assuming a
> cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> tasks may be killed wrongly that don't occupy memory from this Node.

I am wondeirng whether you have seen this happen in practice, or if this is
just based on inspecting the code. I have a feeling that the case you are
concerned about may already be covered in select_bad_process.

out_of_memory(oc)
    select_bad_process(oc)
        oom_evaluate_task(p, oc)
	    oom_cpuset_eligible(task, oc)

	        [...snip...]

		for_each_thread(start, tsk) {
		    if (mask) {
		        ret = mempolicy_in_oom_domain(tsk, mask);
		    } else {
		        ret = cpuset_mems_allowed_intersects(current, tsk)
		    }
		}

While iterating through the list of candidate processes, we check whether
oc->nodemask exists, and if not, we check if the nodemasks intersects. It seems
like these are the two checks that you add in the helper function.

With that said, I might be missing something obvious -- please feel to
correct me if I am misunderstanding your patch or if I'm missing something
in the existing oom target selection : -)

I do see that with your patch, we avoid having to go through select_bad_process
and we just go straight to choosing the current task, which I can definitely
see as an argument. But in that case I think this patch's description would be
more of an optimization, and less of a fix since the behavior is already
accounted for.

Again, please feel free to correct me : -) I hope you have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-04 14:26 ` Joshua Hahn
@ 2025-09-04 14:36   ` Michal Hocko
  2025-09-04 14:43     ` Joshua Hahn
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-09-04 14:36 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: Jinjiang Tu, rientjes, shakeel.butt, akpm, david, ziy,
	matthew.brost, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Thu 04-09-25 07:26:25, Joshua Hahn wrote:
> On Thu, 4 Sep 2025 21:44:31 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:
> 
> Hello Jinjiang,
> 
> I hope you are doing well, thank you for this patchset!
> 
> > out_of_memory() selects tasks without considering mempolicy. Assuming a
> > cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> > allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> > low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> > tasks may be killed wrongly that don't occupy memory from this Node.
> 
> I am wondeirng whether you have seen this happen in practice, or if this is
> just based on inspecting the code. I have a feeling that the case you are
> concerned about may already be covered in select_bad_process.
> 
> out_of_memory(oc)
>     select_bad_process(oc)
>         oom_evaluate_task(p, oc)
> 	    oom_cpuset_eligible(task, oc)
> 	    
> 	        [...snip...]
> 
> 		for_each_thread(start, tsk) {
> 		    if (mask) {
> 		        ret = mempolicy_in_oom_domain(tsk, mask);
> 		    } else {
> 		        ret = cpuset_mems_allowed_intersects(current, tsk)
> 		    }
> 		}
> 
> While iterating through the list of candidate processes, we check whether
> oc->nodemask exists, and if not, we check if the nodemasks intersects. It seems
> like these are the two checks that you add in the helper function.
> 
> With that said, I might be missing something obvious -- please feel to
> correct me if I am misunderstanding your patch or if I'm missing something
> in the existing oom target selection : -)

The thing with mempolicy_in_oom_domain is that it doesn't really do what
you might be thinking it is doing ;) as it will true also for tasks
without any NUMA affinity because those intersect with the given mask by
definition as they can allocate from any node. So they are eligible and
that is what Jinjiang Tu is considered about I believe.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-04 14:36   ` Michal Hocko
@ 2025-09-04 14:43     ` Joshua Hahn
  2025-09-05  2:05       ` Jinjiang Tu
  0 siblings, 1 reply; 21+ messages in thread
From: Joshua Hahn @ 2025-09-04 14:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jinjiang Tu, rientjes, shakeel.butt, akpm, david, ziy,
	matthew.brost, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Thu, 4 Sep 2025 16:36:28 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 04-09-25 07:26:25, Joshua Hahn wrote:
> > On Thu, 4 Sep 2025 21:44:31 +0800 Jinjiang Tu <tujinjiang@huawei.com> wrote:
> > 
> > Hello Jinjiang,
> > 
> > I hope you are doing well, thank you for this patchset!
> > 
> > > out_of_memory() selects tasks without considering mempolicy. Assuming a
> > > cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> > > allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> > > low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> > > tasks may be killed wrongly that don't occupy memory from this Node.
> > 
> > I am wondeirng whether you have seen this happen in practice, or if this is
> > just based on inspecting the code. I have a feeling that the case you are
> > concerned about may already be covered in select_bad_process.
> > 
> > out_of_memory(oc)
> >     select_bad_process(oc)
> >         oom_evaluate_task(p, oc)
> > 	    oom_cpuset_eligible(task, oc)
> > 	    
> > 	        [...snip...]
> > 
> > 		for_each_thread(start, tsk) {
> > 		    if (mask) {
> > 		        ret = mempolicy_in_oom_domain(tsk, mask);
> > 		    } else {
> > 		        ret = cpuset_mems_allowed_intersects(current, tsk)
> > 		    }
> > 		}
> > 
> > While iterating through the list of candidate processes, we check whether
> > oc->nodemask exists, and if not, we check if the nodemasks intersects. It seems
> > like these are the two checks that you add in the helper function.
> > 
> > With that said, I might be missing something obvious -- please feel to
> > correct me if I am misunderstanding your patch or if I'm missing something
> > in the existing oom target selection : -)
> 
> The thing with mempolicy_in_oom_domain is that it doesn't really do what
> you might be thinking it is doing ;) as it will true also for tasks
> without any NUMA affinity because those intersect with the given mask by
> definition as they can allocate from any node. So they are eligible and
> that is what Jinjiang Tu is considered about I believe.

Hello Michal! Thank you for your insights : -)

Looking back, I made the mistake of thinking that we cared about the
!oc->nodemask case, where Jinjiang's patch cares about the oc->nodemask == True
case. So I was checking that cpuset_mems_allowed_intersects was the same as
nodes_intersects, whereas I should have been checking if mempolicy_in_oom_domain
is correct.

Looking into it, everything you said is correct and I think I defintely
overlooked what the patch was trying to do. Thank you for clarifying these
points for me!

I hope you have a great day,
Joshua

> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-04 14:25 ` Michal Hocko
@ 2025-09-05  1:56   ` Jinjiang Tu
  2025-09-05  8:08     ` Michal Hocko
  2025-09-05  9:13   ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-05  1:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

[-- Attachment #1: Type: text/plain, Size: 3074 bytes --]


在 2025/9/4 22:25, Michal Hocko 写道:
> On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
>> out_of_memory() selects tasks without considering mempolicy. Assuming a
>> cpu-less NUMA Node, ordinary process that don't set mempolicy don't
>> allocate memory from this cpu-less Node, unless other NUMA Nodes are below
>> low watermark. If a task binds to this cpu-less Node and triggers OOM, many
>> tasks may be killed wrongly that don't occupy memory from this Node.
> I can see how a miconfigured task that binds _only_ to memoryless nodes
> should be killed but this is not what the patch does, right?  Could you
> tell us more about the specific situation?

We have some cpu-less NUMA Nodes, the memory are hotpluged in, and the zone
is configured as ZONE_MOVABLE to guarantee these used memory can be migrated when
we want to offline the NUMA Node.

Generally tasks doesn't configure any mempolicy and use the default mempolicy, i.e.
allocate from NUMA Node where the task is running on, and fallback to other NUMA Nodes
when the local NUMA Node is below low watermark.As a result, these cpu-less NUMA Nodes
won't be allocated until the NUMA Nodes with cpus are with low memory. However, These
cpu-less NUMA Nodes are configured as ZONE_MOVABLE, can't be used by kernel allocation,
leading to OOM with large amount of MOVABLE memory.

To avoid it, we make some tasks binds to these cpu-less NUMA Nodes to use these memory.
When these tasks trigger OOM, tasks that don't use these cpu-less NUMA Nodes may be killed
according to rss.Even worse, after one task is killed, the allocating task find there is
still no memory, triggers OOM again and kills another wrong task.

>> To fix it, only kill current if oc->nodemask are all nodes without any cpu.
>>
>> Signed-off-by: Jinjiang Tu<tujinjiang@huawei.com>
>> ---
>>   mm/oom_kill.c | 16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 25923cfec9c6..8ae4b2ecfe12 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1100,6 +1100,20 @@ int unregister_oom_notifier(struct notifier_block *nb)
>>   }
>>   EXPORT_SYMBOL_GPL(unregister_oom_notifier);
>>   
>> +static bool should_oom_kill_allocating_task(struct oom_control *oc)
>> +{
>> +	if (sysctl_oom_kill_allocating_task)
>> +		return true;
>> +
>> +	if (!oc->nodemask)
>> +		return false;
>> +
>> +	if (nodes_intersects(*oc->nodemask, node_states[N_CPU]))
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>>   /**
>>    * out_of_memory - kill the "best" process when we run out of memory
>>    * @oc: pointer to struct oom_control
>> @@ -1151,7 +1165,7 @@ bool out_of_memory(struct oom_control *oc)
>>   		oc->nodemask = NULL;
>>   	check_panic_on_oom(oc);
>>   
>> -	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
>> +	if (!is_memcg_oom(oc) && should_oom_kill_allocating_task(oc) &&
>>   	    current->mm && !oom_unkillable_task(current) &&
>>   	    oom_cpuset_eligible(current, oc) &&
>>   	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>> -- 
>> 2.43.0

[-- Attachment #2: Type: text/html, Size: 3860 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-04 14:43     ` Joshua Hahn
@ 2025-09-05  2:05       ` Jinjiang Tu
  0 siblings, 0 replies; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-05  2:05 UTC (permalink / raw)
  To: Joshua Hahn, Michal Hocko
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	rakie.kim, byungchul, gourry, ying.huang, apopple, linux-mm,
	wangkefeng.wang

[-- Attachment #1: Type: text/plain, Size: 2930 bytes --]


在 2025/9/4 22:43, Joshua Hahn 写道:
> On Thu, 4 Sep 2025 16:36:28 +0200 Michal Hocko<mhocko@suse.com> wrote:
>
>> On Thu 04-09-25 07:26:25, Joshua Hahn wrote:
>>> On Thu, 4 Sep 2025 21:44:31 +0800 Jinjiang Tu<tujinjiang@huawei.com> wrote:
>>>
>>> Hello Jinjiang,
>>>
>>> I hope you are doing well, thank you for this patchset!
>>>
>>>> out_of_memory() selects tasks without considering mempolicy. Assuming a
>>>> cpu-less NUMA Node, ordinary process that don't set mempolicy don't
>>>> allocate memory from this cpu-less Node, unless other NUMA Nodes are below
>>>> low watermark. If a task binds to this cpu-less Node and triggers OOM, many
>>>> tasks may be killed wrongly that don't occupy memory from this Node.
>>> I am wondeirng whether you have seen this happen in practice, or if this is
>>> just based on inspecting the code. I have a feeling that the case you are
>>> concerned about may already be covered in select_bad_process.
>>>
>>> out_of_memory(oc)
>>>      select_bad_process(oc)
>>>          oom_evaluate_task(p, oc)
>>> 	    oom_cpuset_eligible(task, oc)
>>> 	
>>> 	        [...snip...]
>>>
>>> 		for_each_thread(start, tsk) {
>>> 		    if (mask) {
>>> 		        ret = mempolicy_in_oom_domain(tsk, mask);
>>> 		    } else {
>>> 		        ret = cpuset_mems_allowed_intersects(current, tsk)
>>> 		    }
>>> 		}
>>>
>>> While iterating through the list of candidate processes, we check whether
>>> oc->nodemask exists, and if not, we check if the nodemasks intersects. It seems
>>> like these are the two checks that you add in the helper function.
>>>
>>> With that said, I might be missing something obvious -- please feel to
>>> correct me if I am misunderstanding your patch or if I'm missing something
>>> in the existing oom target selection : -)
>> The thing with mempolicy_in_oom_domain is that it doesn't really do what
>> you might be thinking it is doing ;) as it will true also for tasks
>> without any NUMA affinity because those intersect with the given mask by
>> definition as they can allocate from any node. So they are eligible and
>> that is what Jinjiang Tu is considered about I believe.
> Hello Michal! Thank you for your insights : -)
>
> Looking back, I made the mistake of thinking that we cared about the
> !oc->nodemask case, where Jinjiang's patch cares about the oc->nodemask == True
> case. So I was checking that cpuset_mems_allowed_intersects was the same as
> nodes_intersects, whereas I should have been checking if mempolicy_in_oom_domain
> is correct.

Most tasks don't mbind to specific nodes. In our use case, as described in the reply
to Michal, ordinary tasks are unlikely to allocate from these cpu-less NUMA Nodes.

>
> Looking into it, everything you said is correct and I think I defintely
> overlooked what the patch was trying to do. Thank you for clarifying these
> points for me!
>
> I hope you have a great day,
> Joshua
>
>> -- 
>> Michal Hocko
>> SUSE Labs

[-- Attachment #2: Type: text/html, Size: 4013 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-05  1:56   ` Jinjiang Tu
@ 2025-09-05  8:08     ` Michal Hocko
  2025-09-05  8:18       ` Jinjiang Tu
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-09-05  8:08 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Fri 05-09-25 09:56:03, Jinjiang Tu wrote:
> 
> 在 2025/9/4 22:25, Michal Hocko 写道:
> > On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
> > > out_of_memory() selects tasks without considering mempolicy. Assuming a
> > > cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> > > allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> > > low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> > > tasks may be killed wrongly that don't occupy memory from this Node.
> > I can see how a miconfigured task that binds _only_ to memoryless nodes
> > should be killed but this is not what the patch does, right?  Could you
> > tell us more about the specific situation?
> 
> We have some cpu-less NUMA Nodes, the memory are hotpluged in, and the zone
> is configured as ZONE_MOVABLE to guarantee these used memory can be migrated when
> we want to offline the NUMA Node.
> 
> Generally tasks doesn't configure any mempolicy and use the default mempolicy, i.e.
> allocate from NUMA Node where the task is running on, and fallback to other NUMA Nodes
> when the local NUMA Node is below low watermark.As a result, these cpu-less NUMA Nodes
> won't be allocated until the NUMA Nodes with cpus are with low memory. However, These
> cpu-less NUMA Nodes are configured as ZONE_MOVABLE, can't be used by kernel allocation,
> leading to OOM with large amount of MOVABLE memory.

Right, this is a fundamental constrain of movable zones. They cannot
satisfy non-movable allocations and you can get OOM for those requests
even if there is plenty of movable memory available. This is no
different from highmem systems and kernel allocations.

> To avoid it, we make some tasks binds to these cpu-less NUMA Nodes to use these memory.
> When these tasks trigger OOM, tasks that don't use these cpu-less NUMA Nodes may be killed
> according to rss.Even worse, after one task is killed, the allocating task find there is
> still no memory, triggers OOM again and kills another wrong task.

Let's see whether I follow you here. So you are binding some tasks to movable
nodes only and if their allocation fails you want to kill that task
rather than invoking mempolicy OOM killer as that could kill tasks
which are not constrained to movable nodes, right?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-05  8:08     ` Michal Hocko
@ 2025-09-05  8:18       ` Jinjiang Tu
  2025-09-05  9:10         ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-05  8:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

[-- Attachment #1: Type: text/plain, Size: 2556 bytes --]


在 2025/9/5 16:08, Michal Hocko 写道:
> On Fri 05-09-25 09:56:03, Jinjiang Tu wrote:
>> 在 2025/9/4 22:25, Michal Hocko 写道:
>>> On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
>>>> out_of_memory() selects tasks without considering mempolicy. Assuming a
>>>> cpu-less NUMA Node, ordinary process that don't set mempolicy don't
>>>> allocate memory from this cpu-less Node, unless other NUMA Nodes are below
>>>> low watermark. If a task binds to this cpu-less Node and triggers OOM, many
>>>> tasks may be killed wrongly that don't occupy memory from this Node.
>>> I can see how a miconfigured task that binds _only_ to memoryless nodes
>>> should be killed but this is not what the patch does, right?  Could you
>>> tell us more about the specific situation?
>> We have some cpu-less NUMA Nodes, the memory are hotpluged in, and the zone
>> is configured as ZONE_MOVABLE to guarantee these used memory can be migrated when
>> we want to offline the NUMA Node.
>>
>> Generally tasks doesn't configure any mempolicy and use the default mempolicy, i.e.
>> allocate from NUMA Node where the task is running on, and fallback to other NUMA Nodes
>> when the local NUMA Node is below low watermark.As a result, these cpu-less NUMA Nodes
>> won't be allocated until the NUMA Nodes with cpus are with low memory. However, These
>> cpu-less NUMA Nodes are configured as ZONE_MOVABLE, can't be used by kernel allocation,
>> leading to OOM with large amount of MOVABLE memory.
> Right, this is a fundamental constrain of movable zones. They cannot
> satisfy non-movable allocations and you can get OOM for those requests
> even if there is plenty of movable memory available. This is no
> different from highmem systems and kernel allocations.
>
>> To avoid it, we make some tasks binds to these cpu-less NUMA Nodes to use these memory.
>> When these tasks trigger OOM, tasks that don't use these cpu-less NUMA Nodes may be killed
>> according to rss.Even worse, after one task is killed, the allocating task find there is
>> still no memory, triggers OOM again and kills another wrong task.
> Let's see whether I follow you here. So you are binding some tasks to movable
> nodes only and if their allocation fails you want to kill that task
> rather than invoking mempolicy OOM killer as that could kill tasks
> which are not constrained to movable nodes, right?

Yes. It't difficult to kill tasks that use movable nodes memory, because we have
no information of per-numa rss of each task. So, kill current task is the simplest way
to avoid killing wrongly.

>

[-- Attachment #2: Type: text/html, Size: 3564 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-05  8:18       ` Jinjiang Tu
@ 2025-09-05  9:10         ` Michal Hocko
  2025-09-05  9:25           ` Jinjiang Tu
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-09-05  9:10 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Fri 05-09-25 16:18:43, Jinjiang Tu wrote:
> 
> 在 2025/9/5 16:08, Michal Hocko 写道:
> > On Fri 05-09-25 09:56:03, Jinjiang Tu wrote:
> > > 在 2025/9/4 22:25, Michal Hocko 写道:
> > > > On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
> > > > > out_of_memory() selects tasks without considering mempolicy. Assuming a
> > > > > cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> > > > > allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> > > > > low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> > > > > tasks may be killed wrongly that don't occupy memory from this Node.
> > > > I can see how a miconfigured task that binds _only_ to memoryless nodes
> > > > should be killed but this is not what the patch does, right?  Could you
> > > > tell us more about the specific situation?
> > > We have some cpu-less NUMA Nodes, the memory are hotpluged in, and the zone
> > > is configured as ZONE_MOVABLE to guarantee these used memory can be migrated when
> > > we want to offline the NUMA Node.
> > > 
> > > Generally tasks doesn't configure any mempolicy and use the default mempolicy, i.e.
> > > allocate from NUMA Node where the task is running on, and fallback to other NUMA Nodes
> > > when the local NUMA Node is below low watermark.As a result, these cpu-less NUMA Nodes
> > > won't be allocated until the NUMA Nodes with cpus are with low memory. However, These
> > > cpu-less NUMA Nodes are configured as ZONE_MOVABLE, can't be used by kernel allocation,
> > > leading to OOM with large amount of MOVABLE memory.
> > Right, this is a fundamental constrain of movable zones. They cannot
> > satisfy non-movable allocations and you can get OOM for those requests
> > even if there is plenty of movable memory available. This is no
> > different from highmem systems and kernel allocations.
> > 
> > > To avoid it, we make some tasks binds to these cpu-less NUMA Nodes to use these memory.
> > > When these tasks trigger OOM, tasks that don't use these cpu-less NUMA Nodes may be killed
> > > according to rss.Even worse, after one task is killed, the allocating task find there is
> > > still no memory, triggers OOM again and kills another wrong task.
> > Let's see whether I follow you here. So you are binding some tasks to movable
> > nodes only and if their allocation fails you want to kill that task
> > rather than invoking mempolicy OOM killer as that could kill tasks
> > which are not constrained to movable nodes, right?
> 
> Yes. It't difficult to kill tasks that use movable nodes memory, because we have
> no information of per-numa rss of each task. So, kill current task is the simplest way
> to avoid killing wrongly.

There were attempts to make the oom killer cpuset aware. This would
allow to constrain the oom killer to a cpuset for which we cannot
satisfy the allocation for. I do not remember details why this reach
meargable state. Have you considered something like that as an option?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-04 14:25 ` Michal Hocko
  2025-09-05  1:56   ` Jinjiang Tu
@ 2025-09-05  9:13   ` Michal Hocko
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2025-09-05  9:13 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Thu 04-09-25 16:25:52, Michal Hocko wrote:
> On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
> > out_of_memory() selects tasks without considering mempolicy. Assuming a
> > cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> > allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> > low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> > tasks may be killed wrongly that don't occupy memory from this Node.
> 
> I can see how a miconfigured task that binds _only_ to memoryless nodes
> should be killed but this is not what the patch does, right?  Could you
> tell us more about the specific situation? 

Now I have realized that I have misread the patch. You are indeed trying
to kill the allocating task only if the allocation nodemask _does_not_
intersect with nodes with CPUs. This is better than I original
understood but still rather ad-hoc heuristic. It doesn't represent the
movability of cpuless nodes and also adds a heuristic that would be
really hard to get rid of later on. As mentioned in other reply I would
recommend looking into using cpusets for this purpose.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-05  9:10         ` Michal Hocko
@ 2025-09-05  9:25           ` Jinjiang Tu
  2025-09-05  9:42             ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-05  9:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

[-- Attachment #1: Type: text/plain, Size: 3206 bytes --]


在 2025/9/5 17:10, Michal Hocko 写道:
> On Fri 05-09-25 16:18:43, Jinjiang Tu wrote:
>> 在 2025/9/5 16:08, Michal Hocko 写道:
>>> On Fri 05-09-25 09:56:03, Jinjiang Tu wrote:
>>>> 在 2025/9/4 22:25, Michal Hocko 写道:
>>>>> On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
>>>>>> out_of_memory() selects tasks without considering mempolicy. Assuming a
>>>>>> cpu-less NUMA Node, ordinary process that don't set mempolicy don't
>>>>>> allocate memory from this cpu-less Node, unless other NUMA Nodes are below
>>>>>> low watermark. If a task binds to this cpu-less Node and triggers OOM, many
>>>>>> tasks may be killed wrongly that don't occupy memory from this Node.
>>>>> I can see how a miconfigured task that binds _only_ to memoryless nodes
>>>>> should be killed but this is not what the patch does, right?  Could you
>>>>> tell us more about the specific situation?
>>>> We have some cpu-less NUMA Nodes, the memory are hotpluged in, and the zone
>>>> is configured as ZONE_MOVABLE to guarantee these used memory can be migrated when
>>>> we want to offline the NUMA Node.
>>>>
>>>> Generally tasks doesn't configure any mempolicy and use the default mempolicy, i.e.
>>>> allocate from NUMA Node where the task is running on, and fallback to other NUMA Nodes
>>>> when the local NUMA Node is below low watermark.As a result, these cpu-less NUMA Nodes
>>>> won't be allocated until the NUMA Nodes with cpus are with low memory. However, These
>>>> cpu-less NUMA Nodes are configured as ZONE_MOVABLE, can't be used by kernel allocation,
>>>> leading to OOM with large amount of MOVABLE memory.
>>> Right, this is a fundamental constrain of movable zones. They cannot
>>> satisfy non-movable allocations and you can get OOM for those requests
>>> even if there is plenty of movable memory available. This is no
>>> different from highmem systems and kernel allocations.
>>>
>>>> To avoid it, we make some tasks binds to these cpu-less NUMA Nodes to use these memory.
>>>> When these tasks trigger OOM, tasks that don't use these cpu-less NUMA Nodes may be killed
>>>> according to rss.Even worse, after one task is killed, the allocating task find there is
>>>> still no memory, triggers OOM again and kills another wrong task.
>>> Let's see whether I follow you here. So you are binding some tasks to movable
>>> nodes only and if their allocation fails you want to kill that task
>>> rather than invoking mempolicy OOM killer as that could kill tasks
>>> which are not constrained to movable nodes, right?
>> Yes. It't difficult to kill tasks that use movable nodes memory, because we have
>> no information of per-numa rss of each task. So, kill current task is the simplest way
>> to avoid killing wrongly.
> There were attempts to make the oom killer cpuset aware. This would
> allow to constrain the oom killer to a cpuset for which we cannot
> satisfy the allocation for. I do not remember details why this reach
> meargable state. Have you considered something like that as an option?

Only select tasks that bind to one of these movable nodes, it seems better.

Although oom killer could only select according to task mempolicy, not vma policy, it't better
than blindly killing current.

[-- Attachment #2: Type: text/html, Size: 4494 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-05  9:25           ` Jinjiang Tu
@ 2025-09-05  9:42             ` Michal Hocko
  2025-09-06  1:56               ` Jinjiang Tu
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-09-05  9:42 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Fri 05-09-25 17:25:44, Jinjiang Tu wrote:
> 
> 在 2025/9/5 17:10, Michal Hocko 写道:
> > On Fri 05-09-25 16:18:43, Jinjiang Tu wrote:
> > > 在 2025/9/5 16:08, Michal Hocko 写道:
> > > > On Fri 05-09-25 09:56:03, Jinjiang Tu wrote:
> > > > > 在 2025/9/4 22:25, Michal Hocko 写道:
> > > > > > On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
> > > > > > > out_of_memory() selects tasks without considering mempolicy. Assuming a
> > > > > > > cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> > > > > > > allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> > > > > > > low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> > > > > > > tasks may be killed wrongly that don't occupy memory from this Node.
> > > > > > I can see how a miconfigured task that binds _only_ to memoryless nodes
> > > > > > should be killed but this is not what the patch does, right?  Could you
> > > > > > tell us more about the specific situation?
> > > > > We have some cpu-less NUMA Nodes, the memory are hotpluged in, and the zone
> > > > > is configured as ZONE_MOVABLE to guarantee these used memory can be migrated when
> > > > > we want to offline the NUMA Node.
> > > > > 
> > > > > Generally tasks doesn't configure any mempolicy and use the default mempolicy, i.e.
> > > > > allocate from NUMA Node where the task is running on, and fallback to other NUMA Nodes
> > > > > when the local NUMA Node is below low watermark.As a result, these cpu-less NUMA Nodes
> > > > > won't be allocated until the NUMA Nodes with cpus are with low memory. However, These
> > > > > cpu-less NUMA Nodes are configured as ZONE_MOVABLE, can't be used by kernel allocation,
> > > > > leading to OOM with large amount of MOVABLE memory.
> > > > Right, this is a fundamental constrain of movable zones. They cannot
> > > > satisfy non-movable allocations and you can get OOM for those requests
> > > > even if there is plenty of movable memory available. This is no
> > > > different from highmem systems and kernel allocations.
> > > > 
> > > > > To avoid it, we make some tasks binds to these cpu-less NUMA Nodes to use these memory.
> > > > > When these tasks trigger OOM, tasks that don't use these cpu-less NUMA Nodes may be killed
> > > > > according to rss.Even worse, after one task is killed, the allocating task find there is
> > > > > still no memory, triggers OOM again and kills another wrong task.
> > > > Let's see whether I follow you here. So you are binding some tasks to movable
> > > > nodes only and if their allocation fails you want to kill that task
> > > > rather than invoking mempolicy OOM killer as that could kill tasks
> > > > which are not constrained to movable nodes, right?
> > > Yes. It't difficult to kill tasks that use movable nodes memory, because we have
> > > no information of per-numa rss of each task. So, kill current task is the simplest way
> > > to avoid killing wrongly.
> > There were attempts to make the oom killer cpuset aware. This would
> > allow to constrain the oom killer to a cpuset for which we cannot
> > satisfy the allocation for. I do not remember details why this reach
> > meargable state. Have you considered something like that as an option?
> 
> Only select tasks that bind to one of these movable nodes, it seems better.
> 
> Although oom killer could only select according to task mempolicy, not vma policy, it't better
> than blindly killing current.

Yes, I do not think we can ever support full mempolicy capabilities but
recognizing this is a cpuset allocation failure and selecting from the
cpuset tasks makes a lot of sense.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-05  9:42             ` Michal Hocko
@ 2025-09-06  1:56               ` Jinjiang Tu
  2025-09-08  7:46                 ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-06  1:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

[-- Attachment #1: Type: text/plain, Size: 4446 bytes --]


在 2025/9/5 17:42, Michal Hocko 写道:
> On Fri 05-09-25 17:25:44, Jinjiang Tu wrote:
>> 在 2025/9/5 17:10, Michal Hocko 写道:
>>> On Fri 05-09-25 16:18:43, Jinjiang Tu wrote:
>>>> 在 2025/9/5 16:08, Michal Hocko 写道:
>>>>> On Fri 05-09-25 09:56:03, Jinjiang Tu wrote:
>>>>>> 在 2025/9/4 22:25, Michal Hocko 写道:
>>>>>>> On Thu 04-09-25 21:44:31, Jinjiang Tu wrote:
>>>>>>>> out_of_memory() selects tasks without considering mempolicy. Assuming a
>>>>>>>> cpu-less NUMA Node, ordinary process that don't set mempolicy don't
>>>>>>>> allocate memory from this cpu-less Node, unless other NUMA Nodes are below
>>>>>>>> low watermark. If a task binds to this cpu-less Node and triggers OOM, many
>>>>>>>> tasks may be killed wrongly that don't occupy memory from this Node.
>>>>>>> I can see how a miconfigured task that binds _only_ to memoryless nodes
>>>>>>> should be killed but this is not what the patch does, right?  Could you
>>>>>>> tell us more about the specific situation?
>>>>>> We have some cpu-less NUMA Nodes, the memory are hotpluged in, and the zone
>>>>>> is configured as ZONE_MOVABLE to guarantee these used memory can be migrated when
>>>>>> we want to offline the NUMA Node.
>>>>>>
>>>>>> Generally tasks doesn't configure any mempolicy and use the default mempolicy, i.e.
>>>>>> allocate from NUMA Node where the task is running on, and fallback to other NUMA Nodes
>>>>>> when the local NUMA Node is below low watermark.As a result, these cpu-less NUMA Nodes
>>>>>> won't be allocated until the NUMA Nodes with cpus are with low memory. However, These
>>>>>> cpu-less NUMA Nodes are configured as ZONE_MOVABLE, can't be used by kernel allocation,
>>>>>> leading to OOM with large amount of MOVABLE memory.
>>>>> Right, this is a fundamental constrain of movable zones. They cannot
>>>>> satisfy non-movable allocations and you can get OOM for those requests
>>>>> even if there is plenty of movable memory available. This is no
>>>>> different from highmem systems and kernel allocations.
>>>>>
>>>>>> To avoid it, we make some tasks binds to these cpu-less NUMA Nodes to use these memory.
>>>>>> When these tasks trigger OOM, tasks that don't use these cpu-less NUMA Nodes may be killed
>>>>>> according to rss.Even worse, after one task is killed, the allocating task find there is
>>>>>> still no memory, triggers OOM again and kills another wrong task.
>>>>> Let's see whether I follow you here. So you are binding some tasks to movable
>>>>> nodes only and if their allocation fails you want to kill that task
>>>>> rather than invoking mempolicy OOM killer as that could kill tasks
>>>>> which are not constrained to movable nodes, right?
>>>> Yes. It't difficult to kill tasks that use movable nodes memory, because we have
>>>> no information of per-numa rss of each task. So, kill current task is the simplest way
>>>> to avoid killing wrongly.
>>> There were attempts to make the oom killer cpuset aware. This would
>>> allow to constrain the oom killer to a cpuset for which we cannot
>>> satisfy the allocation for. I do not remember details why this reach
>>> meargable state. Have you considered something like that as an option?
>> Only select tasks that bind to one of these movable nodes, it seems better.
>>
>> Although oom killer could only select according to task mempolicy, not vma policy, it't better
>> than blindly killing current.
> Yes, I do not think we can ever support full mempolicy capabilities but
> recognizing this is a cpuset allocation failure and selecting from the
> cpuset tasks makes a lot of sense.

In our use case, movable nodes are in all cpusets, so that movable nodes can be
used by all tasks. Even though we move tasks into cpusets that only allow to allocate
from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for
all tasks.

Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask.
Like the following:

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eb83cff7db8c..e56b6de836a6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
         if (!mask)
                 return ret;
  
+       if (!nodes_intersects(*oc->nodemask, node_states[N_CPU]))
+               ret = false;
+
         task_lock(tsk);
         mempolicy = tsk->mempolicy;
         if (mempolicy && mempolicy->mode == MPOL_BIND)

[-- Attachment #2: Type: text/html, Size: 6051 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-06  1:56               ` Jinjiang Tu
@ 2025-09-08  7:46                 ` Michal Hocko
  2025-09-08  8:16                   ` Jinjiang Tu
  0 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2025-09-08  7:46 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Sat 06-09-25 09:56:16, Jinjiang Tu wrote:
> In our use case, movable nodes are in all cpusets, so that movable nodes can be
> used by all tasks. Even though we move tasks into cpusets that only allow to allocate
> from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for
> all tasks.

Right but this is because you allowed _all_ tasks to allocate from those
movable nodes so why would that be an unexpected behavior?

> Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask.
> Like the following:
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index eb83cff7db8c..e56b6de836a6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
>         if (!mask)
>                 return ret;
> +       if (!nodes_intersects(*oc->nodemask, node_states[N_CPU]))
> +               ret = false;
> +

Nope, this doesn't really make much sense TBH. I believe you should stop
special casing cpuless nodes and look into the actual configuration and
check how to make cpuset based OOM tasks selection. Your underlying
problem is not about no CPUs assigned to a numa node but an allocation
constrain based on movability of allocations so you need to find a
solution that is dealing with that constrain.

>         task_lock(tsk);
>         mempolicy = tsk->mempolicy;
>         if (mempolicy && mempolicy->mode == MPOL_BIND)

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-08  7:46                 ` Michal Hocko
@ 2025-09-08  8:16                   ` Jinjiang Tu
  2025-09-08  9:11                     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-08  8:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

[-- Attachment #1: Type: text/plain, Size: 1784 bytes --]


在 2025/9/8 15:46, Michal Hocko 写道:
> On Sat 06-09-25 09:56:16, Jinjiang Tu wrote:
>> In our use case, movable nodes are in all cpusets, so that movable nodes can be
>> used by all tasks. Even though we move tasks into cpusets that only allow to allocate
>> from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for
>> all tasks.
> Right but this is because you allowed _all_ tasks to allocate from those
> movable nodes so why would that be an unexpected behavior?
>
>> Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask.
>> Like the following:
>>
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index eb83cff7db8c..e56b6de836a6 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
>>          if (!mask)
>>                  return ret;
>> +       if (!nodes_intersects(*oc->nodemask, node_states[N_CPU]))
>> +               ret = false;
>> +
> Nope, this doesn't really make much sense TBH. I believe you should stop
> special casing cpuless nodes and look into the actual configuration and
> check how to make cpuset based OOM tasks selection. Your underlying
> problem is not about no CPUs assigned to a numa node but an allocation
> constrain based on movability of allocations so you need to find a
> solution that is dealing with that constrain.

Many tasks are in the root cpuset, systemd for example. The root cpuset
contains all nodes, we couldn't exclude cpu-less nodes.

If we reply on cpuset based OOM tasks selection, tasks in root cpuset may
still be selected.

>
>>          task_lock(tsk);
>>          mempolicy = tsk->mempolicy;
>>          if (mempolicy && mempolicy->mode == MPOL_BIND)

[-- Attachment #2: Type: text/html, Size: 2665 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-08  8:16                   ` Jinjiang Tu
@ 2025-09-08  9:11                     ` Michal Hocko
  2025-09-08 11:07                       ` Jinjiang Tu
  2025-09-08 11:13                       ` Jinjiang Tu
  0 siblings, 2 replies; 21+ messages in thread
From: Michal Hocko @ 2025-09-08  9:11 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Mon 08-09-25 16:16:38, Jinjiang Tu wrote:
> 
> 在 2025/9/8 15:46, Michal Hocko 写道:
> > On Sat 06-09-25 09:56:16, Jinjiang Tu wrote:
> > > In our use case, movable nodes are in all cpusets, so that movable nodes can be
> > > used by all tasks. Even though we move tasks into cpusets that only allow to allocate
> > > from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for
> > > all tasks.
> > Right but this is because you allowed _all_ tasks to allocate from those
> > movable nodes so why would that be an unexpected behavior?
> > 
> > > Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask.
> > > Like the following:
> > > 
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index eb83cff7db8c..e56b6de836a6 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
> > >          if (!mask)
> > >                  return ret;
> > > +       if (!nodes_intersects(*oc->nodemask, node_states[N_CPU]))
> > > +               ret = false;
> > > +
> > Nope, this doesn't really make much sense TBH. I believe you should stop
> > special casing cpuless nodes and look into the actual configuration and
> > check how to make cpuset based OOM tasks selection. Your underlying
> > problem is not about no CPUs assigned to a numa node but an allocation
> > constrain based on movability of allocations so you need to find a
> > solution that is dealing with that constrain.
> 
> Many tasks are in the root cpuset, systemd for example. The root cpuset
> contains all nodes, we couldn't exclude cpu-less nodes.
> 
> If we reply on cpuset based OOM tasks selection, tasks in root cpuset may
> still be selected.

If you start by killing tasks from the cpuset of the currently
allocating task then this shouldn't really happen, right?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-08  9:11                     ` Michal Hocko
@ 2025-09-08 11:07                       ` Jinjiang Tu
  2025-09-08 11:13                       ` Jinjiang Tu
  1 sibling, 0 replies; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-08 11:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

[-- Attachment #1: Type: text/plain, Size: 1954 bytes --]


在 2025/9/8 17:11, Michal Hocko 写道:
> On Mon 08-09-25 16:16:38, Jinjiang Tu wrote:
>> 在 2025/9/8 15:46, Michal Hocko 写道:
>>> On Sat 06-09-25 09:56:16, Jinjiang Tu wrote:
>>>> In our use case, movable nodes are in all cpusets, so that movable nodes can be
>>>> used by all tasks. Even though we move tasks into cpusets that only allow to allocate
>>>> from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for
>>>> all tasks.
>>> Right but this is because you allowed _all_ tasks to allocate from those
>>> movable nodes so why would that be an unexpected behavior?
>>>
>>>> Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask.
>>>> Like the following:
>>>>
>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>>> index eb83cff7db8c..e56b6de836a6 100644
>>>> --- a/mm/mempolicy.c
>>>> +++ b/mm/mempolicy.c
>>>> @@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
>>>>           if (!mask)
>>>>                   return ret;
>>>> +       if (!nodes_intersects(*oc->nodemask, node_states[N_CPU]))
>>>> +               ret = false;
>>>> +
>>> Nope, this doesn't really make much sense TBH. I believe you should stop
>>> special casing cpuless nodes and look into the actual configuration and
>>> check how to make cpuset based OOM tasks selection. Your underlying
>>> problem is not about no CPUs assigned to a numa node but an allocation
>>> constrain based on movability of allocations so you need to find a
>>> solution that is dealing with that constrain.
>> Many tasks are in the root cpuset, systemd for example. The root cpuset
>> contains all nodes, we couldn't exclude cpu-less nodes.
>>
>> If we reply on cpuset based OOM tasks selection, tasks in root cpuset may
>> still be selected.
> If you start by killing tasks from the cpuset of the currently
> allocating task then this shouldn't really happen, right?

Yes, indeed.

[-- Attachment #2: Type: text/html, Size: 2946 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-08  9:11                     ` Michal Hocko
  2025-09-08 11:07                       ` Jinjiang Tu
@ 2025-09-08 11:13                       ` Jinjiang Tu
  2025-09-08 11:26                         ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Jinjiang Tu @ 2025-09-08 11:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

[-- Attachment #1: Type: text/plain, Size: 2111 bytes --]


在 2025/9/8 17:11, Michal Hocko 写道:
> On Mon 08-09-25 16:16:38, Jinjiang Tu wrote:
>> 在 2025/9/8 15:46, Michal Hocko 写道:
>>> On Sat 06-09-25 09:56:16, Jinjiang Tu wrote:
>>>> In our use case, movable nodes are in all cpusets, so that movable nodes can be
>>>> used by all tasks. Even though we move tasks into cpusets that only allow to allocate
>>>> from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for
>>>> all tasks.
>>> Right but this is because you allowed _all_ tasks to allocate from those
>>> movable nodes so why would that be an unexpected behavior?
>>>
>>>> Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask.
>>>> Like the following:
>>>>
>>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>>> index eb83cff7db8c..e56b6de836a6 100644
>>>> --- a/mm/mempolicy.c
>>>> +++ b/mm/mempolicy.c
>>>> @@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
>>>>           if (!mask)
>>>>                   return ret;
>>>> +       if (!nodes_intersects(*oc->nodemask, node_states[N_CPU]))
>>>> +               ret = false;
>>>> +
>>> Nope, this doesn't really make much sense TBH. I believe you should stop
>>> special casing cpuless nodes and look into the actual configuration and
>>> check how to make cpuset based OOM tasks selection. Your underlying
>>> problem is not about no CPUs assigned to a numa node but an allocation
>>> constrain based on movability of allocations so you need to find a
>>> solution that is dealing with that constrain.
>> Many tasks are in the root cpuset, systemd for example. The root cpuset
>> contains all nodes, we couldn't exclude cpu-less nodes.
>>
>> If we reply on cpuset based OOM tasks selection, tasks in root cpuset may
>> still be selected.
> If you start by killing tasks from the cpuset of the currently
> allocating task then this shouldn't really happen, right?

Do you mean we should put the tasks into the same cpuset, and then limit the max usage
of the memcg, make it only trigger memcg OOM, to select tasks from the same memcg?

[-- Attachment #2: Type: text/html, Size: 3103 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-08 11:13                       ` Jinjiang Tu
@ 2025-09-08 11:26                         ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2025-09-08 11:26 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, gourry, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Mon 08-09-25 19:13:52, Jinjiang Tu wrote:
> 
> 在 2025/9/8 17:11, Michal Hocko 写道:
> > On Mon 08-09-25 16:16:38, Jinjiang Tu wrote:
> > > 在 2025/9/8 15:46, Michal Hocko 写道:
> > > > On Sat 06-09-25 09:56:16, Jinjiang Tu wrote:
> > > > > In our use case, movable nodes are in all cpusets, so that movable nodes can be
> > > > > used by all tasks. Even though we move tasks into cpusets that only allow to allocate
> > > > > from movable nodes, oom_cpuset_eligible()->cpuset_mems_allowed_intersects() returns true for
> > > > > all tasks.
> > > > Right but this is because you allowed _all_ tasks to allocate from those
> > > > movable nodes so why would that be an unexpected behavior?
> > > > 
> > > > > Maybe when oc->nodemask == movable nodes, only select tasks whose mempolicy intersects with oc->nodemask.
> > > > > Like the following:
> > > > > 
> > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > > index eb83cff7db8c..e56b6de836a6 100644
> > > > > --- a/mm/mempolicy.c
> > > > > +++ b/mm/mempolicy.c
> > > > > @@ -2328,6 +2328,9 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk,
> > > > >           if (!mask)
> > > > >                   return ret;
> > > > > +       if (!nodes_intersects(*oc->nodemask, node_states[N_CPU]))
> > > > > +               ret = false;
> > > > > +
> > > > Nope, this doesn't really make much sense TBH. I believe you should stop
> > > > special casing cpuless nodes and look into the actual configuration and
> > > > check how to make cpuset based OOM tasks selection. Your underlying
> > > > problem is not about no CPUs assigned to a numa node but an allocation
> > > > constrain based on movability of allocations so you need to find a
> > > > solution that is dealing with that constrain.
> > > Many tasks are in the root cpuset, systemd for example. The root cpuset
> > > contains all nodes, we couldn't exclude cpu-less nodes.
> > > 
> > > If we reply on cpuset based OOM tasks selection, tasks in root cpuset may
> > > still be selected.
> > If you start by killing tasks from the cpuset of the currently
> > allocating task then this shouldn't really happen, right?
> 
> Do you mean we should put the tasks into the same cpuset, and then limit the max usage
> of the memcg, make it only trigger memcg OOM, to select tasks from the same memcg?

No I mean that you should partition your system by cpusets and if there
is a mempolicy OOM situation then you select oom victim from the cpuset
the current task is allocating from. You can imploy memcg cgroup
controller as well but this is orthogonal thing.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes
  2025-09-04 13:44 [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes Jinjiang Tu
  2025-09-04 14:25 ` Michal Hocko
  2025-09-04 14:26 ` Joshua Hahn
@ 2025-09-08 17:50 ` Gregory Price
  2 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2025-09-08 17:50 UTC (permalink / raw)
  To: Jinjiang Tu
  Cc: mhocko, rientjes, shakeel.butt, akpm, david, ziy, matthew.brost,
	joshua.hahnjy, rakie.kim, byungchul, ying.huang, apopple,
	linux-mm, wangkefeng.wang

On Thu, Sep 04, 2025 at 09:44:31PM +0800, Jinjiang Tu wrote:
> out_of_memory() selects tasks without considering mempolicy. Assuming a
> cpu-less NUMA Node, ordinary process that don't set mempolicy don't
> allocate memory from this cpu-less Node, unless other NUMA Nodes are below
> low watermark. If a task binds to this cpu-less Node and triggers OOM, many
> tasks may be killed wrongly that don't occupy memory from this Node.
>

I don't think mempolicy is the right source of information for this, as
mempolicy is non-restrictive.  Mempolicy is not necessarily respected by
reclaim, for example.

A task without mempolicy can still end up using cpu-less nodes for a
number of reasons - for example pro-active reclaim, shared pagecache,
shared file mappings (KSM), etc etc etc.

If mempolicy were restrictive by default, this would be a different
story, but I have to agree with Michal that I don't think mempolicy is
the right source of information for this.  It seems much more appopriate
to use cpusets to inform oom_kill.

> To fix it, only kill current if oc->nodemask are all nodes without any cpu.
> 

This feels very heuristic-y and way too narrow of a use case.  My gut
reaction is that there must be better way to get what you're looking for

~Gregory

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-09-08 17:50 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-04 13:44 [PATCH] mm/oom_kill: kill current in OOM when binding to cpu-less nodes Jinjiang Tu
2025-09-04 14:25 ` Michal Hocko
2025-09-05  1:56   ` Jinjiang Tu
2025-09-05  8:08     ` Michal Hocko
2025-09-05  8:18       ` Jinjiang Tu
2025-09-05  9:10         ` Michal Hocko
2025-09-05  9:25           ` Jinjiang Tu
2025-09-05  9:42             ` Michal Hocko
2025-09-06  1:56               ` Jinjiang Tu
2025-09-08  7:46                 ` Michal Hocko
2025-09-08  8:16                   ` Jinjiang Tu
2025-09-08  9:11                     ` Michal Hocko
2025-09-08 11:07                       ` Jinjiang Tu
2025-09-08 11:13                       ` Jinjiang Tu
2025-09-08 11:26                         ` Michal Hocko
2025-09-05  9:13   ` Michal Hocko
2025-09-04 14:26 ` Joshua Hahn
2025-09-04 14:36   ` Michal Hocko
2025-09-04 14:43     ` Joshua Hahn
2025-09-05  2:05       ` Jinjiang Tu
2025-09-08 17:50 ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox