From: Vlastimil Babka <vbabka@suse.cz>
To: Andrew Morton <akpm@linux-foundation.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Brendan Jackman <jackmanb@google.com>,
Johannes Weiner <hannes@cmpxchg.org>, Zi Yan <ziy@nvidia.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Clark Williams <clrkwllms@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@techsingularity.net>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-rt-devel@lists.linux.dev, stable@vger.kernel.org,
kernel test robot <oliver.sang@intel.com>,
Matthew Wilcox <willy@infradead.org>,
Vlastimil Babka <vbabka@suse.cz>
Subject: [PATCH mm-hotfixes] mm/page_alloc: prevent pcp corruption with SMP=n
Date: Mon, 05 Jan 2026 16:08:56 +0100 [thread overview]
Message-ID: <20260105-fix-pcp-up-v1-1-5579662d2071@suse.cz> (raw)
The kernel test robot has reported:
BUG: spinlock trylock failure on UP on CPU#0, kcompactd0/28
lock: 0xffff888807e35ef0, .magic: dead4ead, .owner: kcompactd0/28, .owner_cpu: 0
CPU: 0 UID: 0 PID: 28 Comm: kcompactd0 Not tainted 6.18.0-rc5-00127-ga06157804399 #1 PREEMPT 8cc09ef94dcec767faa911515ce9e609c45db470
Call Trace:
<IRQ>
__dump_stack (lib/dump_stack.c:95)
dump_stack_lvl (lib/dump_stack.c:123)
dump_stack (lib/dump_stack.c:130)
spin_dump (kernel/locking/spinlock_debug.c:71)
do_raw_spin_trylock (kernel/locking/spinlock_debug.c:?)
_raw_spin_trylock (include/linux/spinlock_api_smp.h:89 kernel/locking/spinlock.c:138)
__free_frozen_pages (mm/page_alloc.c:2973)
___free_pages (mm/page_alloc.c:5295)
__free_pages (mm/page_alloc.c:5334)
tlb_remove_table_rcu (include/linux/mm.h:? include/linux/mm.h:3122 include/asm-generic/tlb.h:220 mm/mmu_gather.c:227 mm/mmu_gather.c:290)
? __cfi_tlb_remove_table_rcu (mm/mmu_gather.c:289)
? rcu_core (kernel/rcu/tree.c:?)
rcu_core (include/linux/rcupdate.h:341 kernel/rcu/tree.c:2607 kernel/rcu/tree.c:2861)
rcu_core_si (kernel/rcu/tree.c:2879)
handle_softirqs (arch/x86/include/asm/jump_label.h:36 include/trace/events/irq.h:142 kernel/softirq.c:623)
__irq_exit_rcu (arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:725)
irq_exit_rcu (kernel/softirq.c:741)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052)
</IRQ>
<TASK>
RIP: 0010:_raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194)
free_pcppages_bulk (mm/page_alloc.c:1494)
drain_pages_zone (include/linux/spinlock.h:391 mm/page_alloc.c:2632)
__drain_all_pages (mm/page_alloc.c:2731)
drain_all_pages (mm/page_alloc.c:2747)
kcompactd (mm/compaction.c:3115)
kthread (kernel/kthread.c:465)
? __cfi_kcompactd (mm/compaction.c:3166)
? __cfi_kthread (kernel/kthread.c:412)
ret_from_fork (arch/x86/kernel/process.c:164)
? __cfi_kthread (kernel/kthread.c:412)
ret_from_fork_asm (arch/x86/entry/entry_64.S:255)
</TASK>
Matthew has analyzed the report and identified that in drain_page_zone()
we are in a section protected by spin_lock(&pcp->lock) and then get an
interrupt that attempts spin_trylock() on the same lock. The code is
designed to work this way without disabling IRQs and occasionally fail
the trylock with a fallback. However, the SMP=n spinlock implementation
assumes spin_trylock() will always succeed, and thus it's normally a
no-op. Here the enabled lock debugging catches the problem, but
otherwise it could cause a corruption of the pcp structure.
The problem has been introduced by commit 574907741599 ("mm/page_alloc:
leave IRQs enabled for per-cpu page allocations"). The pcp locking
scheme recognizes the need for disabling IRQs to prevent nesting
spin_trylock() sections on SMP=n, but the need to prevent the nesting in
spin_lock() has not been recognized. Fix it by introducing local
wrappers that change the spin_lock() to spin_lock_iqsave() with SMP=n
and use them in all places that do spin_lock(&pcp->lock).
Fixes: 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations")
Cc: stable@vger.kernel.org
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202512101320.e2f2dd6f-lkp@intel.com
Analyzed-by: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/all/aUW05pyc9nZkvY-1@casper.infradead.org/
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
This fix is intentionally made self-contained and not trying to expand
upon the existing pcp[u]_spin() helpers. This is to make stable
backports easier due to recent cleanups to that helpers.
We could follow up with a proper helpers integration going forward.
However I think the assumptions SMP=n of the spinlock UP implementation
are just wrong. It should be valid to do a spin_lock() without disabling
irq's and rely on a nested spin_trylock() to fail. I will thus try
proposing the remove the UP implementation first. It should be within
the current trend of removing stuff that's optimized for a minority
configuration if it makes maintainability of the majority worse.
(c.f. recent scheduler SMP=n removal)
---
mm/page_alloc.c | 45 +++++++++++++++++++++++++++++++++++++--------
1 file changed, 37 insertions(+), 8 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a964..ec3551d56cde 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -167,6 +167,31 @@ static inline void __pcp_trylock_noop(unsigned long *flags) { }
pcp_trylock_finish(UP_flags); \
})
+/*
+ * With the UP spinlock implementation, when we spin_lock(&pcp->lock) (for i.e.
+ * a potentially remote cpu drain) and get interrupted by an operation that
+ * attempts pcp_spin_trylock(), we can't rely on the trylock failure due to UP
+ * spinlock assumptions making the trylock a no-op. So we have to turn that
+ * spin_lock() to a spin_lock_irqsave(). This works because on UP there are no
+ * remote cpu's so we can only be locking the only existing local one.
+ */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+static inline void __flags_noop(unsigned long *flags) { }
+#define spin_lock_maybe_irqsave(lock, flags) \
+({ \
+ __flags_noop(&(flags)); \
+ spin_lock(lock); \
+})
+#define spin_unlock_maybe_irqrestore(lock, flags) \
+({ \
+ spin_unlock(lock); \
+ __flags_noop(&(flags)); \
+})
+#else
+#define spin_lock_maybe_irqsave(lock, flags) spin_lock_irqsave(lock, flags)
+#define spin_unlock_maybe_irqrestore(lock, flags) spin_unlock_irqrestore(lock, flags)
+#endif
+
#ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
DEFINE_PER_CPU(int, numa_node);
EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -2556,6 +2581,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, to_drain_batched, batch;
+ unsigned long UP_flags;
bool todo = false;
high_min = READ_ONCE(pcp->high_min);
@@ -2575,9 +2601,9 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
to_drain = pcp->count - pcp->high;
while (to_drain > 0) {
to_drain_batched = min(to_drain, batch);
- spin_lock(&pcp->lock);
+ spin_lock_maybe_irqsave(&pcp->lock, UP_flags);
free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
- spin_unlock(&pcp->lock);
+ spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags);
todo = true;
to_drain -= to_drain_batched;
@@ -2594,14 +2620,15 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
*/
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
+ unsigned long UP_flags;
int to_drain, batch;
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0) {
- spin_lock(&pcp->lock);
+ spin_lock_maybe_irqsave(&pcp->lock, UP_flags);
free_pcppages_bulk(zone, to_drain, pcp, 0);
- spin_unlock(&pcp->lock);
+ spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags);
}
}
#endif
@@ -2612,10 +2639,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ unsigned long UP_flags;
int count;
do {
- spin_lock(&pcp->lock);
+ spin_lock_maybe_irqsave(&pcp->lock, UP_flags);
count = pcp->count;
if (count) {
int to_drain = min(count,
@@ -2624,7 +2652,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
free_pcppages_bulk(zone, to_drain, pcp, 0);
count -= to_drain;
}
- spin_unlock(&pcp->lock);
+ spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags);
} while (count);
}
@@ -6109,6 +6137,7 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu)
{
struct per_cpu_pages *pcp;
struct cpu_cacheinfo *cci;
+ unsigned long UP_flags;
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
cci = get_cpu_cacheinfo(cpu);
@@ -6119,12 +6148,12 @@ static void zone_pcp_update_cacheinfo(struct zone *zone, unsigned int cpu)
* This can reduce zone lock contention without hurting
* cache-hot pages sharing.
*/
- spin_lock(&pcp->lock);
+ spin_lock_maybe_irqsave(&pcp->lock, UP_flags);
if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch)
pcp->flags |= PCPF_FREE_HIGH_BATCH;
else
pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
- spin_unlock(&pcp->lock);
+ spin_unlock_maybe_irqrestore(&pcp->lock, UP_flags);
}
void setup_pcp_cacheinfo(unsigned int cpu)
---
base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
change-id: 20260105-fix-pcp-up-3c88c09752ec
Best regards,
--
Vlastimil Babka <vbabka@suse.cz>
next reply other threads:[~2026-01-05 15:09 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-05 15:08 Vlastimil Babka [this message]
2026-01-05 21:40 ` Steven Rostedt
2026-01-06 8:28 ` Vlastimil Babka
2026-01-06 15:21 ` Steven Rostedt
2026-01-07 21:19 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260105-fix-pcp-up-v1-1-5579662d2071@suse.cz \
--to=vbabka@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=bigeasy@linutronix.de \
--cc=clrkwllms@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=mgorman@techsingularity.net \
--cc=mhocko@suse.com \
--cc=oliver.sang@intel.com \
--cc=rostedt@goodmis.org \
--cc=stable@vger.kernel.org \
--cc=surenb@google.com \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox