* [PATCH] mm/page_alloc: add cond_resched in __drain_all_pages()
@ 2024-12-25 6:26 mengensun88
2024-12-25 23:03 ` David Rientjes
0 siblings, 1 reply; 3+ messages in thread
From: mengensun88 @ 2024-12-25 6:26 UTC (permalink / raw)
To: akpm, linux-mm; +Cc: alexjlzheng, MengEn Sun
From: MengEn Sun <mengensun@tencent.com>
Since version v5.19-rc7, draining remote per-CPU pools (PCP) no
longer relies on workqueues; instead, the current CPU is
responsible for draining the PCPs of all CPUs.
However, due to the lack of scheduling points in the
__drain_all_pages function, this can lead to soft locks in
some extreme cases.
We observed the following soft-lockup stack on a 64-core,
223GB machine during testing:
watchdog: BUG: soft lockup - CPU#29 stuck for 23s! [stress-ng-vm]
RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1c0
_raw_spin_lock
drain_pages_zone
drain_pages
drain_all_pages
__alloc_pages_slowpath
__alloc_pages_nodemask
alloc_pages_vma
do_huge_pmd_anonymous_page
handle_mm_fault
Fixes: <443c2accd1b66> ("mm/page_alloc: remotely drain per-cpu lists")
Reviewed-by: JinLiang Zheng <alexjlzheng@tencent.com>
Signed-off-by: MengEn Sun <mengensun@tencent.com>
---
mm/page_alloc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6c7bb3ea71b..d05b32ec1e40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2487,6 +2487,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
drain_pages_zone(cpu, zone);
else
drain_pages(cpu);
+ cond_resched();
}
mutex_unlock(&pcpu_drain_mutex);
--
2.43.5
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] mm/page_alloc: add cond_resched in __drain_all_pages()
2024-12-25 6:26 [PATCH] mm/page_alloc: add cond_resched in __drain_all_pages() mengensun88
@ 2024-12-25 23:03 ` David Rientjes
2025-01-07 17:39 ` MengEn Sun
0 siblings, 1 reply; 3+ messages in thread
From: David Rientjes @ 2024-12-25 23:03 UTC (permalink / raw)
To: mengensun88; +Cc: akpm, linux-mm, alexjlzheng, MengEn Sun
On Wed, 25 Dec 2024, mengensun88@gmail.com wrote:
> From: MengEn Sun <mengensun@tencent.com>
>
> Since version v5.19-rc7, draining remote per-CPU pools (PCP) no
> longer relies on workqueues; instead, the current CPU is
> responsible for draining the PCPs of all CPUs.
>
> However, due to the lack of scheduling points in the
> __drain_all_pages function, this can lead to soft locks in
> some extreme cases.
>
> We observed the following soft-lockup stack on a 64-core,
> 223GB machine during testing:
> watchdog: BUG: soft lockup - CPU#29 stuck for 23s! [stress-ng-vm]
> RIP: 0010:native_queued_spin_lock_slowpath+0x5b/0x1c0
> _raw_spin_lock
> drain_pages_zone
> drain_pages
> drain_all_pages
> __alloc_pages_slowpath
> __alloc_pages_nodemask
> alloc_pages_vma
> do_huge_pmd_anonymous_page
> handle_mm_fault
>
> Fixes: <443c2accd1b66> ("mm/page_alloc: remotely drain per-cpu lists")
The < > would be removed.
> Reviewed-by: JinLiang Zheng <alexjlzheng@tencent.com>
> Signed-off-by: MengEn Sun <mengensun@tencent.com>
> ---
> mm/page_alloc.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c6c7bb3ea71b..d05b32ec1e40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2487,6 +2487,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
> drain_pages_zone(cpu, zone);
> else
> drain_pages(cpu);
> + cond_resched();
> }
>
> mutex_unlock(&pcpu_drain_mutex);
This is another example of a soft lockup that we haven't observed and we
have systems with many more cores than 64.
Is this happening because of contention on pcp->lock or zone->lock? I
would assume the latter, but best to confirm.
I think this is just papering over a scalability problem with zone->lock.
How many NUMA nodes and zones does this 223GB system have?
If this is a problem with zone->lock, this problem should likely be
addressed more holistically.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] mm/page_alloc: add cond_resched in __drain_all_pages()
2024-12-25 23:03 ` David Rientjes
@ 2025-01-07 17:39 ` MengEn Sun
0 siblings, 0 replies; 3+ messages in thread
From: MengEn Sun @ 2025-01-07 17:39 UTC (permalink / raw)
To: rientjes; +Cc: akpm, alexjlzheng, linux-mm, mengensun88, mengensun
Hi, David
>
> > else
> > drain_pages(cpu);
> > + cond_resched();
> > }
> >
> > mutex_unlock(&pcpu_drain_mutex);
>
> This is another example of a soft lockup that we haven't observed and we
> have systems with many more cores than 64.
It seems that the cause of this issue is not related to the number of CPUs,
but rather more to the ratio of memory capacity to the number of CPUs, or
the total memory capacity itself.
For example, my machine has 64 CPUs and 256 GB of memory, with a single
NUMA node. Under the current kernel, for a single zone, the amount of memory
that can be allocated to the PCP (Per-CPU Pool) across all CPUs is
approximately one-eighth of the total memory in that zone.
So, in the worst-case scenario on my machine:
The total memory in the NORMAL zone is about 32 GB (one-eighth of the total),
and with 64 CPUs, each CPU can receive approximately 512 MB of memory in the
worst case. With a page size of 4 KB, this means each CPU has about 128,000
(100K+) pages in the PCP.
Although the PCP auto-tune algorithm starts to compress the PCP capacity when
memory is tight (for example, when it falls below the high watermark or
during memory reclamation in the zone), this relies on memory allocation and
release on the CPU or a delayed work to trigger this action. However, the
delayed work and memory allocation/release actions are not very controllable.
>
> Is this happening because of contention on pcp->lock or zone->lock? I
> would assume the latter, but best to confirm.
You are right, because we are conducting memory stress testing, and
zone->lock is indeed a hotspot.
> I think this is just papering over a scalability problem with zone->lock.
> How many NUMA nodes and zones does this 223GB system have?
>
> If this is a problem with zone->lock, this problem should likely be
> addressed more holistically.
You are right; the zone->lock issue can indeed become a hotspot in larger
machines. However, I feel that fundamentally solving it is not very easy.
This PCP feature adopts an approach of aggregating tasks for batch
processing.
Another idea is to break the critical sections into smaller ones, but
I'm not sure if this approach is feasible.
Best Regards
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-01-07 17:39 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-25 6:26 [PATCH] mm/page_alloc: add cond_resched in __drain_all_pages() mengensun88
2024-12-25 23:03 ` David Rientjes
2025-01-07 17:39 ` MengEn Sun
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox