* [PATCH] memcg: avoid dead loop when setting memory.max
@ 2025-02-11 8:18 Chen Ridong
2025-02-11 9:02 ` Michal Hocko
2025-02-11 19:04 ` Shakeel Butt
0 siblings, 2 replies; 6+ messages in thread
From: Chen Ridong @ 2025-02-11 8:18 UTC (permalink / raw)
To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm
Cc: cgroups, linux-mm, linux-kernel, chenridong, wangweiyang2
From: Chen Ridong <chenridong@huawei.com>
A softlockup issue was found with stress test:
watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181]
CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1
Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu
RIP: 0010:stop_machine_yield+0x2/0x10
RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246
RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000
RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40
RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff
R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40
R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20
FS: 0000000000000000(0000) GS:ff1cdaaf6eb80000(0000)
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
multi_cpu_stop+0x8f/0x100
cpu_stopper_thread+0x90/0x140
smpboot_thread_fn+0xad/0x150
kthread+0xc2/0x100
ret_from_fork+0x2d/0x50
The stress test involves CPU hotplug operations and memory control group
(memcg) operations. The scenario can be described as follows:
echo xx > memory.max cache_ap_online oom_reaper
(CPU23) (CPU50)
xx < usage stop_machine_from_inactive_cpu
for(;;) // all active cpus
trigger OOM queue_stop_cpus_work
// waiting oom_reaper
multi_cpu_stop(migration/xx)
// sync all active cpus ack
// waiting cpu23 ack
// CPU50 loops in multi_cpu_stop
waiting cpu50
Detailed explanation:
1. When the usage is larger than xx, an OOM may be triggered. If the
process does not handle with ths kill signal immediately, it will loop
in the memory_max_write.
2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
active cpus. Within the multi_cpu_stop function, it attempts to
synchronize the CPU states. However, the CPU23 didn't acknowledge
because it is stuck in a loop within the for(;;).
3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
for CPU23 to acknowledge the synchronization request.
4. Finally, it formed cyclic dependency and lead to softlockup and dead
loop.
To fix this issue, add cond_resched() in the memory_max_write, so that
it will not block migration task.
Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
mm/memcontrol.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8d21c1a44220..16f3bdbd37d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
memcg_memory_event(memcg, MEMCG_OOM);
if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
break;
+ cond_resched();
}
memcg_wb_domain_size_changed(memcg);
--
2.34.1
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] memcg: avoid dead loop when setting memory.max
2025-02-11 8:18 [PATCH] memcg: avoid dead loop when setting memory.max Chen Ridong
@ 2025-02-11 9:02 ` Michal Hocko
2025-02-11 11:29 ` Chen Ridong
2025-02-11 19:04 ` Shakeel Butt
1 sibling, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2025-02-11 9:02 UTC (permalink / raw)
To: Chen Ridong
Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, cgroups,
linux-mm, linux-kernel, chenridong, wangweiyang2
On Tue 11-02-25 08:18:19, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> A softlockup issue was found with stress test:
> watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181]
> CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1
> Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu
> RIP: 0010:stop_machine_yield+0x2/0x10
> RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246
> RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000
> RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40
> RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff
> R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40
> R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20
> FS: 0000000000000000(0000) GS:ff1cdaaf6eb80000(0000)
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> multi_cpu_stop+0x8f/0x100
> cpu_stopper_thread+0x90/0x140
> smpboot_thread_fn+0xad/0x150
> kthread+0xc2/0x100
> ret_from_fork+0x2d/0x50
>
> The stress test involves CPU hotplug operations and memory control group
> (memcg) operations. The scenario can be described as follows:
>
> echo xx > memory.max cache_ap_online oom_reaper
> (CPU23) (CPU50)
> xx < usage stop_machine_from_inactive_cpu
> for(;;) // all active cpus
> trigger OOM queue_stop_cpus_work
> // waiting oom_reaper
> multi_cpu_stop(migration/xx)
> // sync all active cpus ack
> // waiting cpu23 ack
> // CPU50 loops in multi_cpu_stop
> waiting cpu50
>
> Detailed explanation:
> 1. When the usage is larger than xx, an OOM may be triggered. If the
> process does not handle with ths kill signal immediately, it will loop
> in the memory_max_write.
Do I get it right that the issue is that mem_cgroup_out_of_memory which
doesn't have any cond_resched so it cannot yield to stopped kthread?
oom itself cannot make any progress because the oom victim is blocked as
per 3).
> 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
> active cpus. Within the multi_cpu_stop function, it attempts to
> synchronize the CPU states. However, the CPU23 didn't acknowledge
> because it is stuck in a loop within the for(;;).
> 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
> for CPU23 to acknowledge the synchronization request.
> 4. Finally, it formed cyclic dependency and lead to softlockup and dead
> loop.
>
> To fix this issue, add cond_resched() in the memory_max_write, so that
> it will not block migration task.
My first question was why this is not a problem in other
allocation/charge paths but this one is different because it doesn't
ever try to reclaim after MAX_RECLAIM_RETRIES reclaim rounds.
We do have scheduling points in the reclaim path which are no longer
triggered after we hit oom situation in this case.
I was thinking about having a guranteed cond_resched when oom killer
fails to find a victim but it seems the simplest fix for this particular
corner case is to add cond_resched as you did here. Hopefully we will
get rid of it very soon when !PREEMPT is removed.
Btw. this could be a problem on a single CPU machine even without CPU
hotplug as the oom repear won't run until memory_max_write yields the
cpu.
> Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> mm/memcontrol.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8d21c1a44220..16f3bdbd37d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> memcg_memory_event(memcg, MEMCG_OOM);
> if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
> break;
> + cond_resched();
> }
>
> memcg_wb_domain_size_changed(memcg);
> --
> 2.34.1
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] memcg: avoid dead loop when setting memory.max
2025-02-11 9:02 ` Michal Hocko
@ 2025-02-11 11:29 ` Chen Ridong
0 siblings, 0 replies; 6+ messages in thread
From: Chen Ridong @ 2025-02-11 11:29 UTC (permalink / raw)
To: Michal Hocko
Cc: hannes, roman.gushchin, shakeel.butt, muchun.song, akpm, cgroups,
linux-mm, linux-kernel, chenridong, wangweiyang2
On 2025/2/11 17:02, Michal Hocko wrote:
> On Tue 11-02-25 08:18:19, Chen Ridong wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> A softlockup issue was found with stress test:
>> watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181]
>> CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1
>> Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu
>> RIP: 0010:stop_machine_yield+0x2/0x10
>> RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246
>> RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000
>> RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40
>> RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff
>> R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40
>> R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20
>> FS: 0000000000000000(0000) GS:ff1cdaaf6eb80000(0000)
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> PKRU: 55555554
>> Call Trace:
>> multi_cpu_stop+0x8f/0x100
>> cpu_stopper_thread+0x90/0x140
>> smpboot_thread_fn+0xad/0x150
>> kthread+0xc2/0x100
>> ret_from_fork+0x2d/0x50
>>
>> The stress test involves CPU hotplug operations and memory control group
>> (memcg) operations. The scenario can be described as follows:
>>
>> echo xx > memory.max cache_ap_online oom_reaper
>> (CPU23) (CPU50)
>> xx < usage stop_machine_from_inactive_cpu
>> for(;;) // all active cpus
>> trigger OOM queue_stop_cpus_work
>> // waiting oom_reaper
>> multi_cpu_stop(migration/xx)
>> // sync all active cpus ack
>> // waiting cpu23 ack
>> // CPU50 loops in multi_cpu_stop
>> waiting cpu50
>>
>> Detailed explanation:
>> 1. When the usage is larger than xx, an OOM may be triggered. If the
>> process does not handle with ths kill signal immediately, it will loop
>> in the memory_max_write.
>
> Do I get it right that the issue is that mem_cgroup_out_of_memory which
> doesn't have any cond_resched so it cannot yield to stopped kthread?
> oom itself cannot make any progress because the oom victim is blocked as
> per 3).
>
Yes, the same task was evaluated as the victim, which is blocked as
described in point 3). Consequently, the operation returned oc->chosen =
(void *)-1UL in the oom_evaluate_task function, and no cond_resched()
was invoked.
for(;;) {
...
mem_cgroup_out_of_memory
out_of_memory
select_bad_process
oom_evaluate_task
oc->chosen = (void *)-1UL;
return !!oc->chosen;
}
>> 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
>> active cpus. Within the multi_cpu_stop function, it attempts to
>> synchronize the CPU states. However, the CPU23 didn't acknowledge
>> because it is stuck in a loop within the for(;;).
>> 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
>> for CPU23 to acknowledge the synchronization request.
>> 4. Finally, it formed cyclic dependency and lead to softlockup and dead
>> loop.
>>
>> To fix this issue, add cond_resched() in the memory_max_write, so that
>> it will not block migration task.
>
> My first question was why this is not a problem in other
> allocation/charge paths but this one is different because it doesn't
> ever try to reclaim after MAX_RECLAIM_RETRIES reclaim rounds.
> We do have scheduling points in the reclaim path which are no longer
> triggered after we hit oom situation in this case.
>
> I was thinking about having a guranteed cond_resched when oom killer
> fails to find a victim but it seems the simplest fix for this particular
> corner case is to add cond_resched as you did here. Hopefully we will
> get rid of it very soon when !PREEMPT is removed.
>
> Btw. this could be a problem on a single CPU machine even without CPU
> hotplug as the oom repear won't run until memory_max_write yields the
> cpu.
>
>> Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
Thank you very much.
Best regards,
Ridong
>> ---
>> mm/memcontrol.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 8d21c1a44220..16f3bdbd37d8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>> memcg_memory_event(memcg, MEMCG_OOM);
>> if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
>> break;
>> + cond_resched();
>> }
>>
>> memcg_wb_domain_size_changed(memcg);
>> --
>> 2.34.1
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] memcg: avoid dead loop when setting memory.max
2025-02-11 8:18 [PATCH] memcg: avoid dead loop when setting memory.max Chen Ridong
2025-02-11 9:02 ` Michal Hocko
@ 2025-02-11 19:04 ` Shakeel Butt
2025-02-11 20:35 ` Michal Hocko
1 sibling, 1 reply; 6+ messages in thread
From: Shakeel Butt @ 2025-02-11 19:04 UTC (permalink / raw)
To: Chen Ridong
Cc: hannes, mhocko, roman.gushchin, muchun.song, akpm, cgroups,
linux-mm, linux-kernel, chenridong, wangweiyang2
On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> A softlockup issue was found with stress test:
> watchdog: BUG: soft lockup - CPU#27 stuck for 26s! [migration/27:181]
> CPU: 27 UID: 0 PID: 181 Comm: migration/27 6.14.0-rc2-next-20250210 #1
> Stopper: multi_cpu_stop <- stop_machine_from_inactive_cpu
> RIP: 0010:stop_machine_yield+0x2/0x10
> RSP: 0000:ff4a0dcecd19be48 EFLAGS: 00000246
> RAX: ffffffff89c0108f RBX: ff4a0dcec03afe44 RCX: 0000000000000000
> RDX: ff1cdaaf6eba5808 RSI: 0000000000000282 RDI: ff1cda80c1775a40
> RBP: 0000000000000001 R08: 00000011620096c6 R09: 7fffffffffffffff
> R10: 0000000000000001 R11: 0000000000000100 R12: ff1cda80c1775a40
> R13: 0000000000000000 R14: 0000000000000001 R15: ff4a0dcec03afe20
> FS: 0000000000000000(0000) GS:ff1cdaaf6eb80000(0000)
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 00000025e2c2a001 CR4: 0000000000773ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> multi_cpu_stop+0x8f/0x100
> cpu_stopper_thread+0x90/0x140
> smpboot_thread_fn+0xad/0x150
> kthread+0xc2/0x100
> ret_from_fork+0x2d/0x50
>
> The stress test involves CPU hotplug operations and memory control group
> (memcg) operations. The scenario can be described as follows:
>
> echo xx > memory.max cache_ap_online oom_reaper
> (CPU23) (CPU50)
> xx < usage stop_machine_from_inactive_cpu
> for(;;) // all active cpus
> trigger OOM queue_stop_cpus_work
> // waiting oom_reaper
> multi_cpu_stop(migration/xx)
> // sync all active cpus ack
> // waiting cpu23 ack
> // CPU50 loops in multi_cpu_stop
> waiting cpu50
>
> Detailed explanation:
> 1. When the usage is larger than xx, an OOM may be triggered. If the
> process does not handle with ths kill signal immediately, it will loop
> in the memory_max_write.
> 2. When cache_ap_online is triggered, the multi_cpu_stop is queued to the
> active cpus. Within the multi_cpu_stop function, it attempts to
> synchronize the CPU states. However, the CPU23 didn't acknowledge
> because it is stuck in a loop within the for(;;).
> 3. The oom_reaper process is blocked because CPU50 is in a loop, waiting
> for CPU23 to acknowledge the synchronization request.
> 4. Finally, it formed cyclic dependency and lead to softlockup and dead
> loop.
>
> To fix this issue, add cond_resched() in the memory_max_write, so that
> it will not block migration task.
>
> Fixes: b6e6edcfa405 ("mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage")
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
> mm/memcontrol.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8d21c1a44220..16f3bdbd37d8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4213,6 +4213,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> memcg_memory_event(memcg, MEMCG_OOM);
> if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
Wouldn't it be more robust if we put an upper bound on the else case of
above condition i.e. fix number of retries? As you have discovered there
is a hidden dependency on the forward progress of oom_reaper and this
check/code-path which I think is not needed.
> break;
> + cond_resched();
> }
>
> memcg_wb_domain_size_changed(memcg);
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] memcg: avoid dead loop when setting memory.max
2025-02-11 19:04 ` Shakeel Butt
@ 2025-02-11 20:35 ` Michal Hocko
2025-02-12 0:29 ` Shakeel Butt
0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2025-02-11 20:35 UTC (permalink / raw)
To: Shakeel Butt
Cc: Chen Ridong, hannes, roman.gushchin, muchun.song, akpm, cgroups,
linux-mm, linux-kernel, chenridong, wangweiyang2
On Tue 11-02-25 11:04:21, Shakeel Butt wrote:
> On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote:
[...]
> Wouldn't it be more robust if we put an upper bound on the else case of
> above condition i.e. fix number of retries? As you have discovered there
> is a hidden dependency on the forward progress of oom_reaper and this
> check/code-path which I think is not needed.
Any OOM path has a dependency on oom_reaper or task exiting. Is there
any reason why this path should be any special? With cond_resched we can
look for a day where this will be just removed and the code will still
work. With a number of retries we will have a non-deterministic time
dependent behavior because number of retries rather than fwd progress
would define the failure mode.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] memcg: avoid dead loop when setting memory.max
2025-02-11 20:35 ` Michal Hocko
@ 2025-02-12 0:29 ` Shakeel Butt
0 siblings, 0 replies; 6+ messages in thread
From: Shakeel Butt @ 2025-02-12 0:29 UTC (permalink / raw)
To: Michal Hocko
Cc: Chen Ridong, hannes, roman.gushchin, muchun.song, akpm, cgroups,
linux-mm, linux-kernel, chenridong, wangweiyang2
On Tue, Feb 11, 2025 at 09:35:47PM +0100, Michal Hocko wrote:
> On Tue 11-02-25 11:04:21, Shakeel Butt wrote:
> > On Tue, Feb 11, 2025 at 08:18:19AM +0000, Chen Ridong wrote:
> [...]
> > Wouldn't it be more robust if we put an upper bound on the else case of
> > above condition i.e. fix number of retries? As you have discovered there
> > is a hidden dependency on the forward progress of oom_reaper and this
> > check/code-path which I think is not needed.
>
> Any OOM path has a dependency on oom_reaper or task exiting.
Personally I would say any allocation (or charge) has a dependency on
oom_reaper making progress (but not arguing on this point).
> Is there
> any reason why this path should be any special? With cond_resched we can
> look for a day where this will be just removed and the code will still
> work. With a number of retries we will have a non-deterministic time
> dependent behavior because number of retries rather than fwd progress
> would define the failure mode.
I am not against adding cond_resched() which might/will be removed in
future. To me it is just that we are leaving our fate to cpu scheduler
here which maybe is ok as we (MM) do it all over the place. Anyways no
objection from me.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-02-12 0:29 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-11 8:18 [PATCH] memcg: avoid dead loop when setting memory.max Chen Ridong
2025-02-11 9:02 ` Michal Hocko
2025-02-11 11:29 ` Chen Ridong
2025-02-11 19:04 ` Shakeel Butt
2025-02-11 20:35 ` Michal Hocko
2025-02-12 0:29 ` Shakeel Butt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox