* [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with lockless page table walkers
2026-02-02 7:45 [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers Lance Yang
@ 2026-02-02 7:45 ` Lance Yang
2026-02-02 9:42 ` Peter Zijlstra
2026-02-02 7:45 ` [PATCH v4 2/3] mm: switch callers to tlb_remove_table_sync_mm() Lance Yang
` (2 subsequent siblings)
3 siblings, 1 reply; 35+ messages in thread
From: Lance Yang @ 2026-02-02 7:45 UTC (permalink / raw)
To: akpm
Cc: david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang
From: Lance Yang <lance.yang@linux.dev>
Currently, tlb_remove_table_sync_one() broadcasts IPIs to all CPUs to wait
for any concurrent lockless page table walkers (e.g., GUP-fast). This is
inefficient on systems with many CPUs, especially for RT workloads[1].
This patch introduces a per-CPU tracking mechanism to record which CPUs are
actively performing lockless page table walks for a specific mm_struct.
When freeing/unsharing page tables, we can now send IPIs only to the CPUs
that are actually walking that mm, instead of broadcasting to all CPUs.
In preparation for targeted IPIs; a follow-up will switch callers to
tlb_remove_table_sync_mm().
Note that the tracking adds ~3% latency to GUP-fast, as measured on a
64-core system.
[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
include/asm-generic/tlb.h | 2 ++
include/linux/mm.h | 34 ++++++++++++++++++++++++++
kernel/events/core.c | 2 ++
mm/gup.c | 2 ++
mm/mmu_gather.c | 50 +++++++++++++++++++++++++++++++++++++++
5 files changed, 90 insertions(+)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 4aeac0c3d3f0..b6b06e6b879f 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -250,6 +250,7 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
#endif
void tlb_remove_table_sync_one(void);
+void tlb_remove_table_sync_mm(struct mm_struct *mm);
#else
@@ -258,6 +259,7 @@ void tlb_remove_table_sync_one(void);
#endif
static inline void tlb_remove_table_sync_one(void) { }
+static inline void tlb_remove_table_sync_mm(struct mm_struct *mm) { }
#endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8a8fd47399c..d92df995fcd1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2995,6 +2995,40 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end,
pgoff_t *offset);
int folio_add_pins(struct folio *folio, unsigned int pins);
+/*
+ * Track CPUs doing lockless page table walks to avoid broadcast IPIs
+ * during TLB flushes.
+ */
+DECLARE_PER_CPU(struct mm_struct *, active_lockless_pt_walk_mm);
+
+static inline void pt_walk_lockless_start(struct mm_struct *mm)
+{
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * Tell other CPUs we're doing lockless page table walk.
+ *
+ * Full barrier needed to prevent page table reads from being
+ * reordered before this write.
+ *
+ * Pairs with smp_rmb() in tlb_remove_table_sync_mm().
+ */
+ this_cpu_write(active_lockless_pt_walk_mm, mm);
+ smp_mb();
+}
+
+static inline void pt_walk_lockless_end(void)
+{
+ lockdep_assert_irqs_disabled();
+
+ /*
+ * Clear the pointer so other CPUs no longer see this CPU as walking
+ * the mm. Use smp_store_release to ensure page table reads complete
+ * before the clear is visible to other CPUs.
+ */
+ smp_store_release(this_cpu_ptr(&active_lockless_pt_walk_mm), NULL);
+}
+
int get_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
int pin_user_pages_fast(unsigned long start, int nr_pages,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5b5cb620499e..6539112c28ff 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8190,7 +8190,9 @@ static u64 perf_get_page_size(unsigned long addr)
mm = &init_mm;
}
+ pt_walk_lockless_start(mm);
size = perf_get_pgtable_size(mm, addr);
+ pt_walk_lockless_end();
local_irq_restore(flags);
diff --git a/mm/gup.c b/mm/gup.c
index 8e7dc2c6ee73..6748e28b27f2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3154,7 +3154,9 @@ static unsigned long gup_fast(unsigned long start, unsigned long end,
* that come from callers of tlb_remove_table_sync_one().
*/
local_irq_save(flags);
+ pt_walk_lockless_start(current->mm);
gup_fast_pgd_range(start, end, gup_flags, pages, &nr_pinned);
+ pt_walk_lockless_end();
local_irq_restore(flags);
/*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 2faa23d7f8d4..35c89e4b6230 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -285,6 +285,56 @@ void tlb_remove_table_sync_one(void)
smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
}
+DEFINE_PER_CPU(struct mm_struct *, active_lockless_pt_walk_mm);
+EXPORT_PER_CPU_SYMBOL_GPL(active_lockless_pt_walk_mm);
+
+/**
+ * tlb_remove_table_sync_mm - send IPIs to CPUs doing lockless page table
+ * walk for @mm
+ *
+ * @mm: target mm; only CPUs walking this mm get an IPI.
+ *
+ * Like tlb_remove_table_sync_one() but only targets CPUs in
+ * active_lockless_pt_walk_mm.
+ */
+void tlb_remove_table_sync_mm(struct mm_struct *mm)
+{
+ cpumask_var_t target_cpus;
+ bool found_any = false;
+ int cpu;
+
+ if (WARN_ONCE(!mm, "NULL mm in %s\n", __func__)) {
+ tlb_remove_table_sync_one();
+ return;
+ }
+
+ /* If we can't, fall back to broadcast. */
+ if (!alloc_cpumask_var(&target_cpus, GFP_ATOMIC)) {
+ tlb_remove_table_sync_one();
+ return;
+ }
+
+ cpumask_clear(target_cpus);
+
+ /* Pairs with smp_mb() in pt_walk_lockless_start(). */
+ smp_rmb();
+
+ /* Find CPUs doing lockless page table walks for this mm */
+ for_each_online_cpu(cpu) {
+ if (per_cpu(active_lockless_pt_walk_mm, cpu) == mm) {
+ cpumask_set_cpu(cpu, target_cpus);
+ found_any = true;
+ }
+ }
+
+ /* Only send IPIs to CPUs actually doing lockless walks */
+ if (found_any)
+ smp_call_function_many(target_cpus, tlb_remove_table_smp_sync,
+ NULL, 1);
+
+ free_cpumask_var(target_cpus);
+}
+
static void tlb_remove_table_rcu(struct rcu_head *head)
{
__tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
--
2.49.0
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with lockless page table walkers
2026-02-02 7:45 ` [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with " Lance Yang
@ 2026-02-02 9:42 ` Peter Zijlstra
2026-02-02 12:14 ` Lance Yang
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2026-02-02 9:42 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On Mon, Feb 02, 2026 at 03:45:55PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Currently, tlb_remove_table_sync_one() broadcasts IPIs to all CPUs to wait
> for any concurrent lockless page table walkers (e.g., GUP-fast). This is
> inefficient on systems with many CPUs, especially for RT workloads[1].
>
> This patch introduces a per-CPU tracking mechanism to record which CPUs are
> actively performing lockless page table walks for a specific mm_struct.
> When freeing/unsharing page tables, we can now send IPIs only to the CPUs
> that are actually walking that mm, instead of broadcasting to all CPUs.
>
> In preparation for targeted IPIs; a follow-up will switch callers to
> tlb_remove_table_sync_mm().
>
> Note that the tracking adds ~3% latency to GUP-fast, as measured on a
> 64-core system.
What architecture, and that is acceptable?
> +/*
> + * Track CPUs doing lockless page table walks to avoid broadcast IPIs
> + * during TLB flushes.
> + */
> +DECLARE_PER_CPU(struct mm_struct *, active_lockless_pt_walk_mm);
> +
> +static inline void pt_walk_lockless_start(struct mm_struct *mm)
> +{
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * Tell other CPUs we're doing lockless page table walk.
> + *
> + * Full barrier needed to prevent page table reads from being
> + * reordered before this write.
> + *
> + * Pairs with smp_rmb() in tlb_remove_table_sync_mm().
> + */
> + this_cpu_write(active_lockless_pt_walk_mm, mm);
> + smp_mb();
One thing to try is something like:
xchg(this_cpu_ptr(&active_lockless_pt_walk_mm), mm);
That *might* be a little better on x86_64, on anything else you really
don't want to use this_cpu_() ops when you *know* IRQs are already
disabled.
> +}
> +
> +static inline void pt_walk_lockless_end(void)
> +{
> + lockdep_assert_irqs_disabled();
> +
> + /*
> + * Clear the pointer so other CPUs no longer see this CPU as walking
> + * the mm. Use smp_store_release to ensure page table reads complete
> + * before the clear is visible to other CPUs.
> + */
> + smp_store_release(this_cpu_ptr(&active_lockless_pt_walk_mm), NULL);
> +}
> +
> int get_user_pages_fast(unsigned long start, int nr_pages,
> unsigned int gup_flags, struct page **pages);
> int pin_user_pages_fast(unsigned long start, int nr_pages,
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 2faa23d7f8d4..35c89e4b6230 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -285,6 +285,56 @@ void tlb_remove_table_sync_one(void)
> smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
> }
>
> +DEFINE_PER_CPU(struct mm_struct *, active_lockless_pt_walk_mm);
> +EXPORT_PER_CPU_SYMBOL_GPL(active_lockless_pt_walk_mm);
Why the heck is this exported? Both users are firmly core code.
> +/**
> + * tlb_remove_table_sync_mm - send IPIs to CPUs doing lockless page table
> + * walk for @mm
> + *
> + * @mm: target mm; only CPUs walking this mm get an IPI.
> + *
> + * Like tlb_remove_table_sync_one() but only targets CPUs in
> + * active_lockless_pt_walk_mm.
> + */
> +void tlb_remove_table_sync_mm(struct mm_struct *mm)
> +{
> + cpumask_var_t target_cpus;
> + bool found_any = false;
> + int cpu;
> +
> + if (WARN_ONCE(!mm, "NULL mm in %s\n", __func__)) {
> + tlb_remove_table_sync_one();
> + return;
> + }
> +
> + /* If we can't, fall back to broadcast. */
> + if (!alloc_cpumask_var(&target_cpus, GFP_ATOMIC)) {
> + tlb_remove_table_sync_one();
> + return;
> + }
> +
> + cpumask_clear(target_cpus);
> +
> + /* Pairs with smp_mb() in pt_walk_lockless_start(). */
Pairs how? The start thing does something like:
[W] active_lockless_pt_walk_mm = mm
MB
[L] page-tables
So this is:
[L] page-tables
RMB
[L] active_lockless_pt_walk_mm
?
> + smp_rmb();
> +
> + /* Find CPUs doing lockless page table walks for this mm */
> + for_each_online_cpu(cpu) {
> + if (per_cpu(active_lockless_pt_walk_mm, cpu) == mm) {
> + cpumask_set_cpu(cpu, target_cpus);
You really don't need this to be atomic.
> + found_any = true;
> + }
> + }
> +
> + /* Only send IPIs to CPUs actually doing lockless walks */
> + if (found_any)
> + smp_call_function_many(target_cpus, tlb_remove_table_smp_sync,
> + NULL, 1);
Coding style wants { } here. Also, isn't this what we have
smp_call_function_many_cond() for?
> + free_cpumask_var(target_cpus);
> +}
> +
> static void tlb_remove_table_rcu(struct rcu_head *head)
> {
> __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with lockless page table walkers
2026-02-02 9:42 ` Peter Zijlstra
@ 2026-02-02 12:14 ` Lance Yang
2026-02-02 12:51 ` Peter Zijlstra
2026-02-02 16:20 ` Dave Hansen
0 siblings, 2 replies; 35+ messages in thread
From: Lance Yang @ 2026-02-02 12:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: akpm, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
Hi Peter,
Thanks for taking time to review!
On 2026/2/2 17:42, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 03:45:55PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Currently, tlb_remove_table_sync_one() broadcasts IPIs to all CPUs to wait
>> for any concurrent lockless page table walkers (e.g., GUP-fast). This is
>> inefficient on systems with many CPUs, especially for RT workloads[1].
>>
>> This patch introduces a per-CPU tracking mechanism to record which CPUs are
>> actively performing lockless page table walks for a specific mm_struct.
>> When freeing/unsharing page tables, we can now send IPIs only to the CPUs
>> that are actually walking that mm, instead of broadcasting to all CPUs.
>>
>> In preparation for targeted IPIs; a follow-up will switch callers to
>> tlb_remove_table_sync_mm().
>>
>> Note that the tracking adds ~3% latency to GUP-fast, as measured on a
>> 64-core system.
>
> What architecture, and that is acceptable?
x86-64.
I ran ./gup_bench which spawns 60 threads, each doing 500k GUP-fast
operations (pinning 8 pages per call) via the gup_test ioctl.
Results for pin pages:
- Before: avg 1.489s (10 runs)
- After: avg 1.533s (10 runs)
Given we avoid broadcast IPIs on large systems, I think this is a
reasonable trade-off :)
>
>> +/*
>> + * Track CPUs doing lockless page table walks to avoid broadcast IPIs
>> + * during TLB flushes.
>> + */
>> +DECLARE_PER_CPU(struct mm_struct *, active_lockless_pt_walk_mm);
>> +
>> +static inline void pt_walk_lockless_start(struct mm_struct *mm)
>> +{
>> + lockdep_assert_irqs_disabled();
>> +
>> + /*
>> + * Tell other CPUs we're doing lockless page table walk.
>> + *
>> + * Full barrier needed to prevent page table reads from being
>> + * reordered before this write.
>> + *
>> + * Pairs with smp_rmb() in tlb_remove_table_sync_mm().
>> + */
>> + this_cpu_write(active_lockless_pt_walk_mm, mm);
>> + smp_mb();
>
> One thing to try is something like:
>
> xchg(this_cpu_ptr(&active_lockless_pt_walk_mm), mm);
>
> That *might* be a little better on x86_64, on anything else you really
> don't want to use this_cpu_() ops when you *know* IRQs are already
> disabled.
Ah, good to know that. Thanks!
IIUC, xchg() provides the full barrier we need ;)
>
>> +}
>> +
>> +static inline void pt_walk_lockless_end(void)
>> +{
>> + lockdep_assert_irqs_disabled();
>> +
>> + /*
>> + * Clear the pointer so other CPUs no longer see this CPU as walking
>> + * the mm. Use smp_store_release to ensure page table reads complete
>> + * before the clear is visible to other CPUs.
>> + */
>> + smp_store_release(this_cpu_ptr(&active_lockless_pt_walk_mm), NULL);
>> +}
>> +
>> int get_user_pages_fast(unsigned long start, int nr_pages,
>> unsigned int gup_flags, struct page **pages);
>> int pin_user_pages_fast(unsigned long start, int nr_pages,
>
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index 2faa23d7f8d4..35c89e4b6230 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -285,6 +285,56 @@ void tlb_remove_table_sync_one(void)
>> smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
>> }
>>
>> +DEFINE_PER_CPU(struct mm_struct *, active_lockless_pt_walk_mm);
>> +EXPORT_PER_CPU_SYMBOL_GPL(active_lockless_pt_walk_mm);
>
> Why the heck is this exported? Both users are firmly core code.
OK. Will drop this export.
>
>> +/**
>> + * tlb_remove_table_sync_mm - send IPIs to CPUs doing lockless page table
>> + * walk for @mm
>> + *
>> + * @mm: target mm; only CPUs walking this mm get an IPI.
>> + *
>> + * Like tlb_remove_table_sync_one() but only targets CPUs in
>> + * active_lockless_pt_walk_mm.
>> + */
>> +void tlb_remove_table_sync_mm(struct mm_struct *mm)
>> +{
>> + cpumask_var_t target_cpus;
>> + bool found_any = false;
>> + int cpu;
>> +
>> + if (WARN_ONCE(!mm, "NULL mm in %s\n", __func__)) {
>> + tlb_remove_table_sync_one();
>> + return;
>> + }
>> +
>> + /* If we can't, fall back to broadcast. */
>> + if (!alloc_cpumask_var(&target_cpus, GFP_ATOMIC)) {
>> + tlb_remove_table_sync_one();
>> + return;
>> + }
>> +
>> + cpumask_clear(target_cpus);
>> +
>> + /* Pairs with smp_mb() in pt_walk_lockless_start(). */
>
> Pairs how? The start thing does something like:
>
> [W] active_lockless_pt_walk_mm = mm
> MB
> [L] page-tables
>
> So this is:
>
> [L] page-tables
> RMB
> [L] active_lockless_pt_walk_mm
>
> ?
On the walker side (pt_walk_lockless_start):
[W] active_lockless_pt_walk_mm = mm
MB
[L] page-tables (walker reads page tables)
So the walker publishes "I'm walking this mm" before reading page tables.
On the sync side we don't read page-tables. We do:
RMB
[L] active_lockless_pt_walk_mm (we read the per-CPU pointer below)
We need to observe the walker's store of active_lockless_pt_walk_mm before
we decide which CPUs to IPI.
So on the sync side we do smp_rmb(), then read active_lockless_pt_walk_mm.
That pairs with the full barrier in pt_walk_lockless_start().
>
>> + smp_rmb();
>> +
>> + /* Find CPUs doing lockless page table walks for this mm */
>> + for_each_online_cpu(cpu) {
>> + if (per_cpu(active_lockless_pt_walk_mm, cpu) == mm) {
>> + cpumask_set_cpu(cpu, target_cpus);
>
> You really don't need this to be atomic.
>
>> + found_any = true;
>> + }
>> + }
>> +
>> + /* Only send IPIs to CPUs actually doing lockless walks */
>> + if (found_any)
>> + smp_call_function_many(target_cpus, tlb_remove_table_smp_sync,
>> + NULL, 1);
>
> Coding style wants { } here. Also, isn't this what we have
> smp_call_function_many_cond() for?
Right! That would be better, something like:
static bool tlb_remove_table_sync_mm_cond(int cpu, void *mm)
{
return per_cpu(active_lockless_pt_walk_mm, cpu) == (struct mm_struct *)mm;
}
on_each_cpu_cond_mask(tlb_remove_table_sync_mm_cond,
tlb_remove_table_smp_sync,
(void *)mm, true, cpu_online_mask);
>
>> + free_cpumask_var(target_cpus);
>> +}
>> +
>> static void tlb_remove_table_rcu(struct rcu_head *head)
>> {
>> __tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
>> --
>> 2.49.0
>>
Thanks,
Lance
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with lockless page table walkers
2026-02-02 12:14 ` Lance Yang
@ 2026-02-02 12:51 ` Peter Zijlstra
2026-02-02 13:23 ` Lance Yang
2026-02-02 16:20 ` Dave Hansen
1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2026-02-02 12:51 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On Mon, Feb 02, 2026 at 08:14:32PM +0800, Lance Yang wrote:
> > > + /* Pairs with smp_mb() in pt_walk_lockless_start(). */
> >
> > Pairs how? The start thing does something like:
> >
> > [W] active_lockless_pt_walk_mm = mm
> > MB
> > [L] page-tables
> >
> > So this is:
> >
> > [L] page-tables
> > RMB
> > [L] active_lockless_pt_walk_mm
> >
> > ?
>
> On the walker side (pt_walk_lockless_start):
>
> [W] active_lockless_pt_walk_mm = mm
> MB
> [L] page-tables (walker reads page tables)
>
> So the walker publishes "I'm walking this mm" before reading page tables.
>
> On the sync side we don't read page-tables. We do:
>
> RMB
> [L] active_lockless_pt_walk_mm (we read the per-CPU pointer below)
>
> We need to observe the walker's store of active_lockless_pt_walk_mm before
> we decide which CPUs to IPI.
>
> So on the sync side we do smp_rmb(), then read active_lockless_pt_walk_mm.
>
> That pairs with the full barrier in pt_walk_lockless_start().
No it doesn't; this is not how memory barriers work.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with lockless page table walkers
2026-02-02 12:51 ` Peter Zijlstra
@ 2026-02-02 13:23 ` Lance Yang
2026-02-02 13:42 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Lance Yang @ 2026-02-02 13:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: akpm, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On 2026/2/2 20:51, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 08:14:32PM +0800, Lance Yang wrote:
>
>>>> + /* Pairs with smp_mb() in pt_walk_lockless_start(). */
>>>
>>> Pairs how? The start thing does something like:
>>>
>>> [W] active_lockless_pt_walk_mm = mm
>>> MB
>>> [L] page-tables
>>>
>>> So this is:
>>>
>>> [L] page-tables
>>> RMB
>>> [L] active_lockless_pt_walk_mm
>>>
>>> ?
>>
>> On the walker side (pt_walk_lockless_start):
>>
>> [W] active_lockless_pt_walk_mm = mm
>> MB
>> [L] page-tables (walker reads page tables)
>>
>> So the walker publishes "I'm walking this mm" before reading page tables.
>>
>> On the sync side we don't read page-tables. We do:
>>
>> RMB
>> [L] active_lockless_pt_walk_mm (we read the per-CPU pointer below)
>>
>> We need to observe the walker's store of active_lockless_pt_walk_mm before
>> we decide which CPUs to IPI.
>>
>> So on the sync side we do smp_rmb(), then read active_lockless_pt_walk_mm.
>>
>> That pairs with the full barrier in pt_walk_lockless_start().
>
> No it doesn't; this is not how memory barriers work.
Hmm... we need MB rather than RMB on the sync side. Is that correct?
Walker:
[W]active_lockless_pt_walk_mm = mm -> MB -> [L]page-tables
Sync:
[W]page-tables -> MB -> [L]active_lockless_pt_walk_mm
Thanks,
Lance
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with lockless page table walkers
2026-02-02 13:23 ` Lance Yang
@ 2026-02-02 13:42 ` Peter Zijlstra
2026-02-02 14:28 ` Lance Yang
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2026-02-02 13:42 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On Mon, Feb 02, 2026 at 09:23:07PM +0800, Lance Yang wrote:
> Hmm... we need MB rather than RMB on the sync side. Is that correct?
>
> Walker:
> [W]active_lockless_pt_walk_mm = mm -> MB -> [L]page-tables
>
> Sync:
> [W]page-tables -> MB -> [L]active_lockless_pt_walk_mm
>
This can work -- but only if the walker and sync touch the same
page-table address.
Now, typically I would imagine they both share the p4d/pud address at
the very least, right?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with lockless page table walkers
2026-02-02 13:42 ` Peter Zijlstra
@ 2026-02-02 14:28 ` Lance Yang
0 siblings, 0 replies; 35+ messages in thread
From: Lance Yang @ 2026-02-02 14:28 UTC (permalink / raw)
To: Peter Zijlstra
Cc: akpm, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On 2026/2/2 21:42, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 09:23:07PM +0800, Lance Yang wrote:
>
>> Hmm... we need MB rather than RMB on the sync side. Is that correct?
>>
>> Walker:
>> [W]active_lockless_pt_walk_mm = mm -> MB -> [L]page-tables
>>
>> Sync:
>> [W]page-tables -> MB -> [L]active_lockless_pt_walk_mm
>>
>
> This can work -- but only if the walker and sync touch the same
> page-table address.
>
> Now, typically I would imagine they both share the p4d/pud address at
> the very least, right?
Thanks. I think I see the confusion ...
To be clear, the goal is not to make the walker see page-table writes
through the
MB pairing, but to wait for any concurrent lockless page table walkers
to finish.
The flow is:
1) Page tables are modified
2) TLB flush is done
3) Read active_lockless_pt_walk_mm (with MB to order page-table writes
before
this read) to find which CPUs are locklessly walking this mm
4) IPI those CPUs
5) The IPI forces them to sync, so after the IPI returns, any in-flight
lockless
page table walk has finished (or will restart and see the new page
tables)
The synchronization relies on the IPI to ensure walkers stop before
continuing.
I would assume the TLB flush (step 2) should imply some barrier.
Does that clarify?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with lockless page table walkers
2026-02-02 12:14 ` Lance Yang
2026-02-02 12:51 ` Peter Zijlstra
@ 2026-02-02 16:20 ` Dave Hansen
1 sibling, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2026-02-02 16:20 UTC (permalink / raw)
To: Lance Yang, Peter Zijlstra
Cc: akpm, david, dave.hansen, ypodemsk, hughd, will, aneesh.kumar,
npiggin, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, shy828301, riel, jannh, jgross, seanjc, pbonzini,
boris.ostrovsky, virtualization, kvm, linux-arch, linux-mm,
linux-kernel, ioworker0
On 2/2/26 04:14, Lance Yang wrote:
>>> Note that the tracking adds ~3% latency to GUP-fast, as measured on a
>>> 64-core system.
>>
>> What architecture, and that is acceptable?
>
> x86-64.
>
> I ran ./gup_bench which spawns 60 threads, each doing 500k GUP-fast
> operations (pinning 8 pages per call) via the gup_test ioctl.
>
> Results for pin pages:
> - Before: avg 1.489s (10 runs)
> - After: avg 1.533s (10 runs)
>
> Given we avoid broadcast IPIs on large systems, I think this is a
> reasonable trade-off 🙂
I thought the big databases were really sensitive to GUP-fast latency.
They like big systems, too. Won't they howl when this finally hits their
testing?
Also, two of the "write" side here are:
* collapse_huge_page() (khugepaged)
* tlb_remove_table() (in an "-ENOMEM" path)
Those are quite slow paths, right? Shouldn't the design here favor
keeping gup-fast as fast as possible as opposed to impacting those?
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH v4 2/3] mm: switch callers to tlb_remove_table_sync_mm()
2026-02-02 7:45 [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers Lance Yang
2026-02-02 7:45 ` [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with " Lance Yang
@ 2026-02-02 7:45 ` Lance Yang
2026-02-02 7:45 ` [PATCH v4 3/3] x86/tlb: add architecture-specific TLB IPI optimization support Lance Yang
2026-02-02 9:54 ` [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers Peter Zijlstra
3 siblings, 0 replies; 35+ messages in thread
From: Lance Yang @ 2026-02-02 7:45 UTC (permalink / raw)
To: akpm
Cc: david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang
From: Lance Yang <lance.yang@linux.dev>
Now that we have tlb_remove_table_sync_mm(), convert callers from
tlb_remove_table_sync_one() to enable targeted IPIs instead of broadcast.
Three callers updated:
1) collapse_huge_page() - after flushing the old PMD, only IPIs CPUs
walking this mm instead of all CPUs.
2) tlb_flush_unshared_tables() - when unsharing hugetlb page tables,
use tlb->mm for targeted IPIs.
3) __tlb_remove_table_one() - updated to take mmu_gather parameter so
it can use tlb->mm when batch allocation fails.
Note that pmdp_get_lockless_sync() (PAE only) also calls
tlb_remove_table_sync_one() under PTL to ensure all ongoing PMD split-reads
complete between pmdp_get_lockless_{start,end}; the critical section is
very short. I'm inclined not to convert it since PAE systems typically
don't have many cores.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
include/asm-generic/tlb.h | 11 ++++++-----
mm/khugepaged.c | 2 +-
mm/mmu_gather.c | 12 ++++++------
3 files changed, 13 insertions(+), 12 deletions(-)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b6b06e6b879f..40eb74b28f9d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -831,17 +831,18 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
/*
* Similarly, we must make sure that concurrent GUP-fast will not
* walk previously-shared page tables that are getting modified+reused
- * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast.
+ * elsewhere. So send an IPI to wait for any concurrent GUP-fast.
*
- * We only perform this when we are the last sharer of a page table,
- * as the IPI will reach all CPUs: any GUP-fast.
+ * We only perform this when we are the last sharer of a page table.
+ * Use targeted IPI to CPUs actively walking this mm instead of
+ * broadcast.
*
- * Note that on configs where tlb_remove_table_sync_one() is a NOP,
+ * Note that on configs where tlb_remove_table_sync_mm() is a NOP,
* the expectation is that the tlb_flush_mmu_tlbonly() would have issued
* required IPIs already for us.
*/
if (tlb->fully_unshared_tables) {
- tlb_remove_table_sync_one();
+ tlb_remove_table_sync_mm(tlb->mm);
tlb->fully_unshared_tables = false;
}
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa1e57fd2c46..7781d6628649 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1173,7 +1173,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
_pmd = pmdp_collapse_flush(vma, address, pmd);
spin_unlock(pmd_ptl);
mmu_notifier_invalidate_range_end(&range);
- tlb_remove_table_sync_one();
+ tlb_remove_table_sync_mm(mm);
pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
if (pte) {
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 35c89e4b6230..76573ec454e5 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -378,7 +378,7 @@ static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
__tlb_remove_table(ptdesc);
}
-static inline void __tlb_remove_table_one(void *table)
+static inline void __tlb_remove_table_one(struct mmu_gather *tlb, void *table)
{
struct ptdesc *ptdesc;
@@ -386,16 +386,16 @@ static inline void __tlb_remove_table_one(void *table)
call_rcu(&ptdesc->pt_rcu_head, __tlb_remove_table_one_rcu);
}
#else
-static inline void __tlb_remove_table_one(void *table)
+static inline void __tlb_remove_table_one(struct mmu_gather *tlb, void *table)
{
- tlb_remove_table_sync_one();
+ tlb_remove_table_sync_mm(tlb->mm);
__tlb_remove_table(table);
}
#endif /* CONFIG_PT_RECLAIM */
-static void tlb_remove_table_one(void *table)
+static void tlb_remove_table_one(struct mmu_gather *tlb, void *table)
{
- __tlb_remove_table_one(table);
+ __tlb_remove_table_one(tlb, table);
}
static void tlb_table_flush(struct mmu_gather *tlb)
@@ -417,7 +417,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT);
if (*batch == NULL) {
tlb_table_invalidate(tlb);
- tlb_remove_table_one(table);
+ tlb_remove_table_one(tlb, table);
return;
}
(*batch)->nr = 0;
--
2.49.0
^ permalink raw reply [flat|nested] 35+ messages in thread* [PATCH v4 3/3] x86/tlb: add architecture-specific TLB IPI optimization support
2026-02-02 7:45 [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers Lance Yang
2026-02-02 7:45 ` [PATCH v4 1/3] mm: use targeted IPIs for TLB sync with " Lance Yang
2026-02-02 7:45 ` [PATCH v4 2/3] mm: switch callers to tlb_remove_table_sync_mm() Lance Yang
@ 2026-02-02 7:45 ` Lance Yang
2026-02-02 9:54 ` [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers Peter Zijlstra
3 siblings, 0 replies; 35+ messages in thread
From: Lance Yang @ 2026-02-02 7:45 UTC (permalink / raw)
To: akpm
Cc: david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, peterz, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang
From: Lance Yang <lance.yang@linux.dev>
When the TLB flush path already sends IPIs (e.g. native without INVLPGB,
or KVM), tlb_remove_table_sync_mm() does not need to send another round.
Add a property on pv_mmu_ops so each paravirt backend can indicate whether
its flush_tlb_multi sends real IPIs; if so, tlb_remove_table_sync_mm() is
a no-op.
Native sets it in native_pv_tlb_init() when still using
native_flush_tlb_multi() and INVLPGB is disabled. KVM sets it true; Xen and
Hyper-V set it false because they use hypercalls.
Also pass both freed_tables and unshared_tables from tlb_flush() into
flush_tlb_mm_range() so lazy-TLB CPUs get IPIs during hugetlb unshare.
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
arch/x86/hyperv/mmu.c | 5 +++++
arch/x86/include/asm/paravirt.h | 5 +++++
arch/x86/include/asm/paravirt_types.h | 6 ++++++
arch/x86/include/asm/tlb.h | 20 +++++++++++++++++++-
arch/x86/kernel/kvm.c | 6 ++++++
arch/x86/kernel/paravirt.c | 18 ++++++++++++++++++
arch/x86/kernel/smpboot.c | 1 +
arch/x86/xen/mmu_pv.c | 2 ++
include/asm-generic/tlb.h | 15 +++++++++++++++
mm/mmu_gather.c | 7 +++++++
10 files changed, 84 insertions(+), 1 deletion(-)
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index cfcb60468b01..fc8fb275f295 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -243,4 +243,9 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
+ /*
+ * Hyper-V uses hypercalls for TLB flush, not real IPIs.
+ * Keep the property as false.
+ */
+ pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast = false;
}
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 13f9cd31c8f8..1fdbe3736f41 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -698,6 +698,7 @@ static __always_inline unsigned long arch_local_irq_save(void)
extern void default_banner(void);
void native_pv_lock_init(void) __init;
+void native_pv_tlb_init(void) __init;
#else /* __ASSEMBLER__ */
@@ -727,6 +728,10 @@ void native_pv_lock_init(void) __init;
static inline void native_pv_lock_init(void)
{
}
+
+static inline void native_pv_tlb_init(void)
+{
+}
#endif
#endif /* !CONFIG_PARAVIRT */
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3502939415ad..d8aa519ef5e3 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -133,6 +133,12 @@ struct pv_mmu_ops {
void (*flush_tlb_multi)(const struct cpumask *cpus,
const struct flush_tlb_info *info);
+ /*
+ * Indicates whether flush_tlb_multi IPIs provide sufficient
+ * synchronization during TLB flush when freeing or unsharing page tables.
+ */
+ bool flush_tlb_multi_implies_ipi_broadcast;
+
/* Hook for intercepting the destruction of an mm_struct. */
void (*exit_mmap)(struct mm_struct *mm);
void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..1e524d8e260a 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,10 +5,23 @@
#define tlb_flush tlb_flush
static inline void tlb_flush(struct mmu_gather *tlb);
+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
#include <asm-generic/tlb.h>
#include <linux/kernel.h>
#include <vdso/bits.h>
#include <vdso/page.h>
+#include <asm/paravirt.h>
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+#ifdef CONFIG_PARAVIRT
+ return pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast;
+#else
+ return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
+#endif
+}
static inline void tlb_flush(struct mmu_gather *tlb)
{
@@ -20,7 +33,12 @@ static inline void tlb_flush(struct mmu_gather *tlb)
end = tlb->end;
}
- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+ /*
+ * During TLB flushes, pass both freed_tables and unshared_tables
+ * so lazy-TLB CPUs receive IPIs.
+ */
+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+ tlb->freed_tables || tlb->unshared_tables);
}
static inline void invlpg(unsigned long addr)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 37dc8465e0f5..6a5e47ee4eb6 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -856,6 +856,12 @@ static void __init kvm_guest_init(void)
#ifdef CONFIG_SMP
if (pv_tlb_flush_supported()) {
pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
+ /*
+ * KVM's flush implementation calls native_flush_tlb_multi(),
+ * which sends real IPIs when INVLPGB is not available.
+ */
+ if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
+ pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast = true;
pr_info("KVM setup pv remote TLB flush\n");
}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc69..1af253c9f51d 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -60,6 +60,23 @@ void __init native_pv_lock_init(void)
static_branch_enable(&virt_spin_lock_key);
}
+void __init native_pv_tlb_init(void)
+{
+ /*
+ * Check if we're still using native TLB flush (not overridden by
+ * a PV backend) and don't have INVLPGB support.
+ *
+ * In this case, native IPI-based TLB flush provides sufficient
+ * synchronization for GUP-fast.
+ *
+ * PV backends (KVM, Xen, HyperV) should set this property in their
+ * own initialization code if their flush implementation sends IPIs.
+ */
+ if (pv_ops.mmu.flush_tlb_multi == native_flush_tlb_multi &&
+ !cpu_feature_enabled(X86_FEATURE_INVLPGB))
+ pv_ops.mmu.flush_tlb_multi_implies_ipi_broadcast = true;
+}
+
struct static_key paravirt_steal_enabled;
struct static_key paravirt_steal_rq_enabled;
@@ -173,6 +190,7 @@ struct paravirt_patch_template pv_ops = {
.mmu.flush_tlb_kernel = native_flush_tlb_global,
.mmu.flush_tlb_one_user = native_flush_tlb_one_user,
.mmu.flush_tlb_multi = native_flush_tlb_multi,
+ .mmu.flush_tlb_multi_implies_ipi_broadcast = false,
.mmu.exit_mmap = paravirt_nop,
.mmu.notify_page_enc_status_changed = paravirt_nop,
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5cd6950ab672..3cdb04162843 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1167,6 +1167,7 @@ void __init native_smp_prepare_boot_cpu(void)
switch_gdt_and_percpu_base(me);
native_pv_lock_init();
+ native_pv_tlb_init();
}
void __init native_smp_cpus_done(unsigned int max_cpus)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 7a35c3393df4..b6d86299cf10 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2185,6 +2185,8 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
.flush_tlb_kernel = xen_flush_tlb,
.flush_tlb_one_user = xen_flush_tlb_one_user,
.flush_tlb_multi = xen_flush_tlb_multi,
+ /* Xen uses hypercalls for TLB flush, not real IPIs */
+ .flush_tlb_multi_implies_ipi_broadcast = false,
.pgd_alloc = xen_pgd_alloc,
.pgd_free = xen_pgd_free,
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 40eb74b28f9d..fae97c8bcceb 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -240,6 +240,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
}
#endif /* CONFIG_MMU_GATHER_TABLE_FREE */
+/*
+ * Architectures can override this to indicate whether TLB flush operations
+ * send IPIs that are sufficient to synchronize with lockless page table
+ * walkers (e.g., GUP-fast). If true, tlb_remove_table_sync_mm() becomes
+ * a no-op as the TLB flush already provided the necessary IPI.
+ *
+ * Default is false, meaning we need explicit IPIs via tlb_remove_table_sync_mm().
+ */
+#ifndef tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+ return false;
+}
+#endif
+
#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
/*
* This allows an architecture that does not use the linux page-tables for
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 76573ec454e5..9620480c11ce 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -303,6 +303,13 @@ void tlb_remove_table_sync_mm(struct mm_struct *mm)
bool found_any = false;
int cpu;
+ /*
+ * If the architecture's TLB flush already sent IPIs that are sufficient
+ * for synchronization, we don't need to send additional IPIs.
+ */
+ if (tlb_table_flush_implies_ipi_broadcast())
+ return;
+
if (WARN_ONCE(!mm, "NULL mm in %s\n", __func__)) {
tlb_remove_table_sync_one();
return;
--
2.49.0
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers
2026-02-02 7:45 [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers Lance Yang
` (2 preceding siblings ...)
2026-02-02 7:45 ` [PATCH v4 3/3] x86/tlb: add architecture-specific TLB IPI optimization support Lance Yang
@ 2026-02-02 9:54 ` Peter Zijlstra
2026-02-02 11:00 ` [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table Lance Yang
3 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2026-02-02 9:54 UTC (permalink / raw)
To: Lance Yang
Cc: akpm, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On Mon, Feb 02, 2026 at 03:45:54PM +0800, Lance Yang wrote:
> When freeing or unsharing page tables we send an IPI to synchronize with
> concurrent lockless page table walkers (e.g. GUP-fast). Today we broadcast
> that IPI to all CPUs, which is costly on large machines and hurts RT
> workloads[1].
>
> This series makes those IPIs targeted. We track which CPUs are currently
> doing a lockless page table walk for a given mm (per-CPU
> active_lockless_pt_walk_mm). When we need to sync, we only IPI those CPUs.
> GUP-fast and perf_get_page_size() set/clear the tracker around their walk;
> tlb_remove_table_sync_mm() uses it and replaces the previous broadcast in
> the free/unshare paths.
I'm confused. This only happens when !PT_RECLAIM, because if PT_RECLAIM
__tlb_remove_table_one() actually uses RCU.
So why are you making things more expensive for no reason?
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 9:54 ` [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table walkers Peter Zijlstra
@ 2026-02-02 11:00 ` Lance Yang
2026-02-02 12:50 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Lance Yang @ 2026-02-02 11:00 UTC (permalink / raw)
To: peterz
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, david, dev.jain,
hpa, hughd, ioworker0, jannh, jgross, kvm, lance.yang,
linux-arch, linux-kernel, linux-mm, lorenzo.stoakes, mingo,
npache, npiggin, pbonzini, riel, ryan.roberts, seanjc, shy828301,
tglx, virtualization, will, x86, ypodemsk, ziy
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 1678 bytes --]
On Mon, 2 Feb 2026 10:54:14 +0100, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 03:45:54PM +0800, Lance Yang wrote:
> > When freeing or unsharing page tables we send an IPI to synchronize with
> > concurrent lockless page table walkers (e.g. GUP-fast). Today we broadcast
> > that IPI to all CPUs, which is costly on large machines and hurts RT
> > workloads[1].
> >
> > This series makes those IPIs targeted. We track which CPUs are currently
> > doing a lockless page table walk for a given mm (per-CPU
> > active_lockless_pt_walk_mm). When we need to sync, we only IPI those CPUs.
> > GUP-fast and perf_get_page_size() set/clear the tracker around their walk;
> > tlb_remove_table_sync_mm() uses it and replaces the previous broadcast in
> > the free/unshare paths.
>
> I'm confused. This only happens when !PT_RECLAIM, because if PT_RECLAIM
> __tlb_remove_table_one() actually uses RCU.
>
> So why are you making things more expensive for no reason?
You're right that when CONFIG_PT_RECLAIM is set, __tlb_remove_table_one()
uses call_rcu() and we never call any sync there — this series doesn't
touch that path.
In the !PT_RECLAIM table-free path (same __tlb_remove_table_one() branch
that calls tlb_remove_table_sync_mm(tlb->mm) before __tlb_remove_table),
we're not adding any new sync; we're replacing the existing broadcast IPI
(tlb_remove_table_sync_one()) with targeted IPIs (tlb_remove_table_sync_mm()).
One thing I just realized: when CONFIG_MMU_GATHER_RCU_TABLE_FREE is not
set, the sync path isn't used at all (tlb_remove_table_sync_one() and
friends aren't even compiled), so we don't need the tracker in that config.
Thanks for raising this!
Lance
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 11:00 ` [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table Lance Yang
@ 2026-02-02 12:50 ` Peter Zijlstra
2026-02-02 12:58 ` Lance Yang
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2026-02-02 12:50 UTC (permalink / raw)
To: Lance Yang
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, david, dev.jain,
hpa, hughd, ioworker0, jannh, jgross, kvm, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
pbonzini, riel, ryan.roberts, seanjc, shy828301, tglx,
virtualization, will, x86, ypodemsk, ziy
On Mon, Feb 02, 2026 at 07:00:16PM +0800, Lance Yang wrote:
>
> On Mon, 2 Feb 2026 10:54:14 +0100, Peter Zijlstra wrote:
> > On Mon, Feb 02, 2026 at 03:45:54PM +0800, Lance Yang wrote:
> > > When freeing or unsharing page tables we send an IPI to synchronize with
> > > concurrent lockless page table walkers (e.g. GUP-fast). Today we broadcast
> > > that IPI to all CPUs, which is costly on large machines and hurts RT
> > > workloads[1].
> > >
> > > This series makes those IPIs targeted. We track which CPUs are currently
> > > doing a lockless page table walk for a given mm (per-CPU
> > > active_lockless_pt_walk_mm). When we need to sync, we only IPI those CPUs.
> > > GUP-fast and perf_get_page_size() set/clear the tracker around their walk;
> > > tlb_remove_table_sync_mm() uses it and replaces the previous broadcast in
> > > the free/unshare paths.
> >
> > I'm confused. This only happens when !PT_RECLAIM, because if PT_RECLAIM
> > __tlb_remove_table_one() actually uses RCU.
> >
> > So why are you making things more expensive for no reason?
>
> You're right that when CONFIG_PT_RECLAIM is set, __tlb_remove_table_one()
> uses call_rcu() and we never call any sync there — this series doesn't
> touch that path.
>
> In the !PT_RECLAIM table-free path (same __tlb_remove_table_one() branch
> that calls tlb_remove_table_sync_mm(tlb->mm) before __tlb_remove_table),
> we're not adding any new sync; we're replacing the existing broadcast IPI
> (tlb_remove_table_sync_one()) with targeted IPIs (tlb_remove_table_sync_mm()).
Right, but if we can use full RCU for PT_RECLAIM, why can't we do so
unconditionally and not add overhead?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 12:50 ` Peter Zijlstra
@ 2026-02-02 12:58 ` Lance Yang
2026-02-02 13:07 ` Lance Yang
0 siblings, 1 reply; 35+ messages in thread
From: Lance Yang @ 2026-02-02 12:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, david, dev.jain,
hpa, hughd, ioworker0, jannh, jgross, kvm, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
pbonzini, riel, ryan.roberts, seanjc, shy828301, tglx,
virtualization, will, x86, ypodemsk, ziy
On 2026/2/2 20:50, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 07:00:16PM +0800, Lance Yang wrote:
>>
>> On Mon, 2 Feb 2026 10:54:14 +0100, Peter Zijlstra wrote:
>>> On Mon, Feb 02, 2026 at 03:45:54PM +0800, Lance Yang wrote:
>>>> When freeing or unsharing page tables we send an IPI to synchronize with
>>>> concurrent lockless page table walkers (e.g. GUP-fast). Today we broadcast
>>>> that IPI to all CPUs, which is costly on large machines and hurts RT
>>>> workloads[1].
>>>>
>>>> This series makes those IPIs targeted. We track which CPUs are currently
>>>> doing a lockless page table walk for a given mm (per-CPU
>>>> active_lockless_pt_walk_mm). When we need to sync, we only IPI those CPUs.
>>>> GUP-fast and perf_get_page_size() set/clear the tracker around their walk;
>>>> tlb_remove_table_sync_mm() uses it and replaces the previous broadcast in
>>>> the free/unshare paths.
>>>
>>> I'm confused. This only happens when !PT_RECLAIM, because if PT_RECLAIM
>>> __tlb_remove_table_one() actually uses RCU.
>>>
>>> So why are you making things more expensive for no reason?
>>
>> You're right that when CONFIG_PT_RECLAIM is set, __tlb_remove_table_one()
>> uses call_rcu() and we never call any sync there — this series doesn't
>> touch that path.
>>
>> In the !PT_RECLAIM table-free path (same __tlb_remove_table_one() branch
>> that calls tlb_remove_table_sync_mm(tlb->mm) before __tlb_remove_table),
>> we're not adding any new sync; we're replacing the existing broadcast IPI
>> (tlb_remove_table_sync_one()) with targeted IPIs (tlb_remove_table_sync_mm()).
>
> Right, but if we can use full RCU for PT_RECLAIM, why can't we do so
> unconditionally and not add overhead?
The sync (IPI) is mainly needed for unshare (e.g. hugetlb) and collapse
(khugepaged) paths, regardless of whether table free uses RCU, IIUC.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 12:58 ` Lance Yang
@ 2026-02-02 13:07 ` Lance Yang
2026-02-02 13:37 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Lance Yang @ 2026-02-02 13:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, david, dev.jain,
hpa, hughd, ioworker0, jannh, jgross, kvm, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
pbonzini, riel, ryan.roberts, seanjc, shy828301, tglx,
virtualization, will, x86, ypodemsk, ziy
On 2026/2/2 20:58, Lance Yang wrote:
>
>
> On 2026/2/2 20:50, Peter Zijlstra wrote:
>> On Mon, Feb 02, 2026 at 07:00:16PM +0800, Lance Yang wrote:
>>>
>>> On Mon, 2 Feb 2026 10:54:14 +0100, Peter Zijlstra wrote:
>>>> On Mon, Feb 02, 2026 at 03:45:54PM +0800, Lance Yang wrote:
>>>>> When freeing or unsharing page tables we send an IPI to synchronize
>>>>> with
>>>>> concurrent lockless page table walkers (e.g. GUP-fast). Today we
>>>>> broadcast
>>>>> that IPI to all CPUs, which is costly on large machines and hurts RT
>>>>> workloads[1].
>>>>>
>>>>> This series makes those IPIs targeted. We track which CPUs are
>>>>> currently
>>>>> doing a lockless page table walk for a given mm (per-CPU
>>>>> active_lockless_pt_walk_mm). When we need to sync, we only IPI
>>>>> those CPUs.
>>>>> GUP-fast and perf_get_page_size() set/clear the tracker around
>>>>> their walk;
>>>>> tlb_remove_table_sync_mm() uses it and replaces the previous
>>>>> broadcast in
>>>>> the free/unshare paths.
>>>>
>>>> I'm confused. This only happens when !PT_RECLAIM, because if PT_RECLAIM
>>>> __tlb_remove_table_one() actually uses RCU.
>>>>
>>>> So why are you making things more expensive for no reason?
>>>
>>> You're right that when CONFIG_PT_RECLAIM is set,
>>> __tlb_remove_table_one()
>>> uses call_rcu() and we never call any sync there — this series doesn't
>>> touch that path.
>>>
>>> In the !PT_RECLAIM table-free path (same __tlb_remove_table_one() branch
>>> that calls tlb_remove_table_sync_mm(tlb->mm) before __tlb_remove_table),
>>> we're not adding any new sync; we're replacing the existing broadcast
>>> IPI
>>> (tlb_remove_table_sync_one()) with targeted IPIs
>>> (tlb_remove_table_sync_mm()).
>>
>> Right, but if we can use full RCU for PT_RECLAIM, why can't we do so
>> unconditionally and not add overhead?
>
> The sync (IPI) is mainly needed for unshare (e.g. hugetlb) and collapse
> (khugepaged) paths, regardless of whether table free uses RCU, IIUC.
In addition: We need the sync when we modify page tables (e.g. unshare,
collapse), not only when we free them. RCU can defer freeing but does
not prevent lockless walkers from seeing concurrent in-place
modifications, so we need the IPI to synchronize with those walkers
first.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 13:07 ` Lance Yang
@ 2026-02-02 13:37 ` Peter Zijlstra
2026-02-02 14:37 ` Lance Yang
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2026-02-02 13:37 UTC (permalink / raw)
To: Lance Yang
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, david, dev.jain,
hpa, hughd, ioworker0, jannh, jgross, kvm, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
pbonzini, riel, ryan.roberts, seanjc, shy828301, tglx,
virtualization, will, x86, ypodemsk, ziy
On Mon, Feb 02, 2026 at 09:07:10PM +0800, Lance Yang wrote:
> > > Right, but if we can use full RCU for PT_RECLAIM, why can't we do so
> > > unconditionally and not add overhead?
> >
> > The sync (IPI) is mainly needed for unshare (e.g. hugetlb) and collapse
> > (khugepaged) paths, regardless of whether table free uses RCU, IIUC.
>
> In addition: We need the sync when we modify page tables (e.g. unshare,
> collapse), not only when we free them. RCU can defer freeing but does
> not prevent lockless walkers from seeing concurrent in-place
> modifications, so we need the IPI to synchronize with those walkers
> first.
Currently PT_RECLAIM=y has no IPI; are you saying that is broken? If
not, then why do we need this at all?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 13:37 ` Peter Zijlstra
@ 2026-02-02 14:37 ` Lance Yang
2026-02-02 15:09 ` Peter Zijlstra
0 siblings, 1 reply; 35+ messages in thread
From: Lance Yang @ 2026-02-02 14:37 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, david, dev.jain,
hpa, hughd, ioworker0, jannh, jgross, kvm, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
pbonzini, riel, ryan.roberts, seanjc, shy828301, tglx,
virtualization, will, x86, ypodemsk, ziy
On 2026/2/2 21:37, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 09:07:10PM +0800, Lance Yang wrote:
>
>>>> Right, but if we can use full RCU for PT_RECLAIM, why can't we do so
>>>> unconditionally and not add overhead?
>>>
>>> The sync (IPI) is mainly needed for unshare (e.g. hugetlb) and collapse
>>> (khugepaged) paths, regardless of whether table free uses RCU, IIUC.
>>
>> In addition: We need the sync when we modify page tables (e.g. unshare,
>> collapse), not only when we free them. RCU can defer freeing but does
>> not prevent lockless walkers from seeing concurrent in-place
>> modifications, so we need the IPI to synchronize with those walkers
>> first.
>
> Currently PT_RECLAIM=y has no IPI; are you saying that is broken? If
> not, then why do we need this at all?
PT_RECLAIM=y does have IPI for unshare/collapse — those paths call
tlb_flush_unshared_tables() (for hugetlb unshare) and collapse_huge_page()
(in khugepaged collapse), which already send IPIs today (broadcast to all
CPUs via tlb_remove_table_sync_one()).
What PT_RECLAIM=y doesn't need IPI for is table freeing (
__tlb_remove_table_one() uses call_rcu() instead). But table modification
(unshare, collapse) still needs IPI to synchronize with lockless walkers,
regardless of PT_RECLAIM.
So PT_RECLAIM=y is not broken; it already has IPI where needed. This series
just makes those IPIs targeted instead of broadcast. Does that clarify?
Thanks,
Lance
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 14:37 ` Lance Yang
@ 2026-02-02 15:09 ` Peter Zijlstra
2026-02-02 15:52 ` Lance Yang
0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2026-02-02 15:09 UTC (permalink / raw)
To: Lance Yang
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, david, dev.jain,
hpa, hughd, ioworker0, jannh, jgross, kvm, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
pbonzini, riel, ryan.roberts, seanjc, shy828301, tglx,
virtualization, will, x86, ypodemsk, ziy
On Mon, Feb 02, 2026 at 10:37:39PM +0800, Lance Yang wrote:
>
>
> On 2026/2/2 21:37, Peter Zijlstra wrote:
> > On Mon, Feb 02, 2026 at 09:07:10PM +0800, Lance Yang wrote:
> >
> > > > > Right, but if we can use full RCU for PT_RECLAIM, why can't we do so
> > > > > unconditionally and not add overhead?
> > > >
> > > > The sync (IPI) is mainly needed for unshare (e.g. hugetlb) and collapse
> > > > (khugepaged) paths, regardless of whether table free uses RCU, IIUC.
> > >
> > > In addition: We need the sync when we modify page tables (e.g. unshare,
> > > collapse), not only when we free them. RCU can defer freeing but does
> > > not prevent lockless walkers from seeing concurrent in-place
> > > modifications, so we need the IPI to synchronize with those walkers
> > > first.
> >
> > Currently PT_RECLAIM=y has no IPI; are you saying that is broken? If
> > not, then why do we need this at all?
>
> PT_RECLAIM=y does have IPI for unshare/collapse — those paths call
> tlb_flush_unshared_tables() (for hugetlb unshare) and collapse_huge_page()
> (in khugepaged collapse), which already send IPIs today (broadcast to all
> CPUs via tlb_remove_table_sync_one()).
>
> What PT_RECLAIM=y doesn't need IPI for is table freeing (
> __tlb_remove_table_one() uses call_rcu() instead). But table modification
> (unshare, collapse) still needs IPI to synchronize with lockless walkers,
> regardless of PT_RECLAIM.
>
> So PT_RECLAIM=y is not broken; it already has IPI where needed. This series
> just makes those IPIs targeted instead of broadcast. Does that clarify?
Oh bah, reading is hard. I had missed they had more table_sync_one() calls,
rather than remove_table_one().
So you *can* replace table_sync_one() with rcu_sync(), that will provide
the same guarantees. Its just a 'little' bit slower on the update side,
but does not incur the read side cost.
I really think anything here needs to better explain the various
requirements. Because now everybody gets to pay the price for hugetlb
shared crud, while 'nobody' will actually use that.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 15:09 ` Peter Zijlstra
@ 2026-02-02 15:52 ` Lance Yang
2026-02-05 13:25 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 35+ messages in thread
From: Lance Yang @ 2026-02-02 15:52 UTC (permalink / raw)
To: Peter Zijlstra, david
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, dev.jain, hpa,
hughd, ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2026/2/2 23:09, Peter Zijlstra wrote:
> On Mon, Feb 02, 2026 at 10:37:39PM +0800, Lance Yang wrote:
>>
>>
>> On 2026/2/2 21:37, Peter Zijlstra wrote:
>>> On Mon, Feb 02, 2026 at 09:07:10PM +0800, Lance Yang wrote:
>>>
>>>>>> Right, but if we can use full RCU for PT_RECLAIM, why can't we do so
>>>>>> unconditionally and not add overhead?
>>>>>
>>>>> The sync (IPI) is mainly needed for unshare (e.g. hugetlb) and collapse
>>>>> (khugepaged) paths, regardless of whether table free uses RCU, IIUC.
>>>>
>>>> In addition: We need the sync when we modify page tables (e.g. unshare,
>>>> collapse), not only when we free them. RCU can defer freeing but does
>>>> not prevent lockless walkers from seeing concurrent in-place
>>>> modifications, so we need the IPI to synchronize with those walkers
>>>> first.
>>>
>>> Currently PT_RECLAIM=y has no IPI; are you saying that is broken? If
>>> not, then why do we need this at all?
>>
>> PT_RECLAIM=y does have IPI for unshare/collapse — those paths call
>> tlb_flush_unshared_tables() (for hugetlb unshare) and collapse_huge_page()
>> (in khugepaged collapse), which already send IPIs today (broadcast to all
>> CPUs via tlb_remove_table_sync_one()).
>>
>> What PT_RECLAIM=y doesn't need IPI for is table freeing (
>> __tlb_remove_table_one() uses call_rcu() instead). But table modification
>> (unshare, collapse) still needs IPI to synchronize with lockless walkers,
>> regardless of PT_RECLAIM.
>>
>> So PT_RECLAIM=y is not broken; it already has IPI where needed. This series
>> just makes those IPIs targeted instead of broadcast. Does that clarify?
>
> Oh bah, reading is hard. I had missed they had more table_sync_one() calls,
> rather than remove_table_one().
>
> So you *can* replace table_sync_one() with rcu_sync(), that will provide
> the same guarantees. Its just a 'little' bit slower on the update side,
> but does not incur the read side cost.
Yep, we could replace the IPI with synchronize_rcu() on the sync side:
- Currently: TLB flush → send IPI → wait for walkers to finish
- With synchronize_rcu(): TLB flush → synchronize_rcu() -> waits for
grace period
Lockless walkers (e.g. GUP-fast) use local_irq_disable();
synchronize_rcu() also
waits for regions with preemption/interrupts disabled, so it should
work, IIUC.
And then, the trade-off would be:
- Read side: zero cost (no per-CPU tracking)
- Write side: wait for RCU grace period (potentially slower)
For collapse/unshare, that write-side latency might be acceptable :)
@David, what do you think?
>
> I really think anything here needs to better explain the various
> requirements. Because now everybody gets to pay the price for hugetlb
> shared crud, while 'nobody' will actually use that.
Right. If we go with synchronize_rcu(), the read-side cost goes away ...
Thanks,
Lance
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-02 15:52 ` Lance Yang
@ 2026-02-05 13:25 ` David Hildenbrand (Arm)
2026-02-05 15:01 ` Lance Yang
0 siblings, 1 reply; 35+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 13:25 UTC (permalink / raw)
To: Lance Yang, Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dave.hansen, dev.jain, hpa,
hughd, ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2/2/26 16:52, Lance Yang wrote:
>
>
> On 2026/2/2 23:09, Peter Zijlstra wrote:
>> On Mon, Feb 02, 2026 at 10:37:39PM +0800, Lance Yang wrote:
>>>
>>>
>>>
>>> PT_RECLAIM=y does have IPI for unshare/collapse — those paths call
>>> tlb_flush_unshared_tables() (for hugetlb unshare) and
>>> collapse_huge_page()
>>> (in khugepaged collapse), which already send IPIs today (broadcast to
>>> all
>>> CPUs via tlb_remove_table_sync_one()).
>>>
>>> What PT_RECLAIM=y doesn't need IPI for is table freeing (
>>> __tlb_remove_table_one() uses call_rcu() instead). But table
>>> modification
>>> (unshare, collapse) still needs IPI to synchronize with lockless
>>> walkers,
>>> regardless of PT_RECLAIM.
>>>
>>> So PT_RECLAIM=y is not broken; it already has IPI where needed. This
>>> series
>>> just makes those IPIs targeted instead of broadcast. Does that clarify?
>>
>> Oh bah, reading is hard. I had missed they had more table_sync_one()
>> calls,
>> rather than remove_table_one().
>>
>> So you *can* replace table_sync_one() with rcu_sync(), that will provide
>> the same guarantees. Its just a 'little' bit slower on the update side,
>> but does not incur the read side cost.
>
> Yep, we could replace the IPI with synchronize_rcu() on the sync side:
>
> - Currently: TLB flush → send IPI → wait for walkers to finish
> - With synchronize_rcu(): TLB flush → synchronize_rcu() -> waits for
> grace period
>
> Lockless walkers (e.g. GUP-fast) use local_irq_disable();
> synchronize_rcu() also
> waits for regions with preemption/interrupts disabled, so it should
> work, IIUC.
>
> And then, the trade-off would be:
> - Read side: zero cost (no per-CPU tracking)
> - Write side: wait for RCU grace period (potentially slower)
>
> For collapse/unshare, that write-side latency might be acceptable :)
>
> @David, what do you think?
Given that we just fixed the write-side latency from breaking Oracle's
databases completely, we have to be a bit careful here :)
The thing is: on many x86 configs we don't need *any* TLB flushed or RCU
syncs.
So "how much slower" are we talking about, especially on bigger/loaded
systems?
--
Cheers,
David
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 13:25 ` David Hildenbrand (Arm)
@ 2026-02-05 15:01 ` Lance Yang
2026-02-05 15:05 ` David Hildenbrand (Arm)
2026-02-05 15:09 ` Dave Hansen
0 siblings, 2 replies; 35+ messages in thread
From: Lance Yang @ 2026-02-05 15:01 UTC (permalink / raw)
To: David Hildenbrand (Arm), Peter Zijlstra, dave.hansen
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2026/2/5 21:25, David Hildenbrand (Arm) wrote:
> On 2/2/26 16:52, Lance Yang wrote:
>>
>>
>> On 2026/2/2 23:09, Peter Zijlstra wrote:
>>> On Mon, Feb 02, 2026 at 10:37:39PM +0800, Lance Yang wrote:
>>>>
>>>>
>>>>
>>>> PT_RECLAIM=y does have IPI for unshare/collapse — those paths call
>>>> tlb_flush_unshared_tables() (for hugetlb unshare) and
>>>> collapse_huge_page()
>>>> (in khugepaged collapse), which already send IPIs today (broadcast
>>>> to all
>>>> CPUs via tlb_remove_table_sync_one()).
>>>>
>>>> What PT_RECLAIM=y doesn't need IPI for is table freeing (
>>>> __tlb_remove_table_one() uses call_rcu() instead). But table
>>>> modification
>>>> (unshare, collapse) still needs IPI to synchronize with lockless
>>>> walkers,
>>>> regardless of PT_RECLAIM.
>>>>
>>>> So PT_RECLAIM=y is not broken; it already has IPI where needed. This
>>>> series
>>>> just makes those IPIs targeted instead of broadcast. Does that clarify?
>>>
>>> Oh bah, reading is hard. I had missed they had more table_sync_one()
>>> calls,
>>> rather than remove_table_one().
>>>
>>> So you *can* replace table_sync_one() with rcu_sync(), that will provide
>>> the same guarantees. Its just a 'little' bit slower on the update side,
>>> but does not incur the read side cost.
>>
>> Yep, we could replace the IPI with synchronize_rcu() on the sync side:
>>
>> - Currently: TLB flush → send IPI → wait for walkers to finish
>> - With synchronize_rcu(): TLB flush → synchronize_rcu() -> waits for
>> grace period
>>
>> Lockless walkers (e.g. GUP-fast) use local_irq_disable();
>> synchronize_rcu() also
>> waits for regions with preemption/interrupts disabled, so it should
>> work, IIUC.
>>
>> And then, the trade-off would be:
>> - Read side: zero cost (no per-CPU tracking)
>> - Write side: wait for RCU grace period (potentially slower)
>>
>> For collapse/unshare, that write-side latency might be acceptable :)
>>
>> @David, what do you think?
>
> Given that we just fixed the write-side latency from breaking Oracle's
> databases completely, we have to be a bit careful here :)
Yep, agreed.
>
> The thing is: on many x86 configs we don't need *any* TLB flushed or RCU
> syncs.
Right. Looks like that is low-hanging fruit. I'll send that out
separately :)
>
> So "how much slower" are we talking about, especially on bigger/loaded
> systems?
Unfortunately the numbers are pretry bad. On an x86-64 64-core system
under high load, each synchronize_rcu() is about *22.9* ms on average ...
So for now, neither approach looks good: tracking on the read side adss
cost to GUP-fast, and syncing on the write side e.g. synchronize_rcu()
is too slow on large systems.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 15:01 ` Lance Yang
@ 2026-02-05 15:05 ` David Hildenbrand (Arm)
2026-02-05 15:28 ` Lance Yang
2026-02-05 15:09 ` Dave Hansen
1 sibling, 1 reply; 35+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 15:05 UTC (permalink / raw)
To: Lance Yang, Peter Zijlstra, dave.hansen
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2/5/26 16:01, Lance Yang wrote:
>
>
> On 2026/2/5 21:25, David Hildenbrand (Arm) wrote:
>> On 2/2/26 16:52, Lance Yang wrote:
>>>
>>>
>>>
>>> Yep, we could replace the IPI with synchronize_rcu() on the sync side:
>>>
>>> - Currently: TLB flush → send IPI → wait for walkers to finish
>>> - With synchronize_rcu(): TLB flush → synchronize_rcu() -> waits for
>>> grace period
>>>
>>> Lockless walkers (e.g. GUP-fast) use local_irq_disable();
>>> synchronize_rcu() also
>>> waits for regions with preemption/interrupts disabled, so it should
>>> work, IIUC.
>>>
>>> And then, the trade-off would be:
>>> - Read side: zero cost (no per-CPU tracking)
>>> - Write side: wait for RCU grace period (potentially slower)
>>>
>>> For collapse/unshare, that write-side latency might be acceptable :)
>>>
>>> @David, what do you think?
>>
>> Given that we just fixed the write-side latency from breaking Oracle's
>> databases completely, we have to be a bit careful here :)
>
> Yep, agreed.
>
>>
>> The thing is: on many x86 configs we don't need *any* TLB flushed or
>> RCU syncs.
>
> Right. Looks like that is low-hanging fruit. I'll send that out
> separately :)
>
>>
>> So "how much slower" are we talking about, especially on bigger/loaded
>> systems?
>
> Unfortunately the numbers are pretry bad. On an x86-64 64-core system
> under high load, each synchronize_rcu() is about *22.9* ms on average ...
>
> So for now, neither approach looks good: tracking on the read side adss
> cost to GUP-fast, and syncing on the write side e.g. synchronize_rcu()
> is too slow on large systems.
GUP-fast is 3%, right? Any way we can reduce that to 1% and call it
noise? :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 15:05 ` David Hildenbrand (Arm)
@ 2026-02-05 15:28 ` Lance Yang
0 siblings, 0 replies; 35+ messages in thread
From: Lance Yang @ 2026-02-05 15:28 UTC (permalink / raw)
To: David Hildenbrand (Arm), Peter Zijlstra, dave.hansen
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2026/2/5 23:05, David Hildenbrand (Arm) wrote:
> On 2/5/26 16:01, Lance Yang wrote:
>>
>>
>> On 2026/2/5 21:25, David Hildenbrand (Arm) wrote:
>>> On 2/2/26 16:52, Lance Yang wrote:
>>>>
>>>>
>>>>
>>>> Yep, we could replace the IPI with synchronize_rcu() on the sync side:
>>>>
>>>> - Currently: TLB flush → send IPI → wait for walkers to finish
>>>> - With synchronize_rcu(): TLB flush → synchronize_rcu() -> waits for
>>>> grace period
>>>>
>>>> Lockless walkers (e.g. GUP-fast) use local_irq_disable();
>>>> synchronize_rcu() also
>>>> waits for regions with preemption/interrupts disabled, so it should
>>>> work, IIUC.
>>>>
>>>> And then, the trade-off would be:
>>>> - Read side: zero cost (no per-CPU tracking)
>>>> - Write side: wait for RCU grace period (potentially slower)
>>>>
>>>> For collapse/unshare, that write-side latency might be acceptable :)
>>>>
>>>> @David, what do you think?
>>>
>>> Given that we just fixed the write-side latency from breaking
>>> Oracle's databases completely, we have to be a bit careful here :)
>>
>> Yep, agreed.
>>
>>>
>>> The thing is: on many x86 configs we don't need *any* TLB flushed or
>>> RCU syncs.
>>
>> Right. Looks like that is low-hanging fruit. I'll send that out
>> separately :)
>>
>>>
>>> So "how much slower" are we talking about, especially on bigger/
>>> loaded systems?
>>
>> Unfortunately the numbers are pretry bad. On an x86-64 64-core system
>> under high load, each synchronize_rcu() is about *22.9* ms on average ...
>>
>> So for now, neither approach looks good: tracking on the read side adss
>> cost to GUP-fast, and syncing on the write side e.g. synchronize_rcu()
>> is too slow on large systems.
>
> GUP-fast is 3%, right? Any way we can reduce that to 1% and call it
> noise? :)
Yes, GUP-fast is ~3%. I'll keep trying to do that, but first getting
the low-hanging fruit done :)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 15:01 ` Lance Yang
2026-02-05 15:05 ` David Hildenbrand (Arm)
@ 2026-02-05 15:09 ` Dave Hansen
2026-02-05 15:31 ` Lance Yang
1 sibling, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2026-02-05 15:09 UTC (permalink / raw)
To: Lance Yang, David Hildenbrand (Arm), Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2/5/26 07:01, Lance Yang wrote:
> So for now, neither approach looks good: tracking on the read side adss
> cost to GUP-fast, and syncing on the write side e.g. synchronize_rcu()
> is too slow on large systems.
Which of the writers truly *need* synchronize_rcu()?
What are they doing with the memory that they can't move forward unless
it's quiescent *now*?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 15:09 ` Dave Hansen
@ 2026-02-05 15:31 ` Lance Yang
2026-02-05 15:41 ` Dave Hansen
0 siblings, 1 reply; 35+ messages in thread
From: Lance Yang @ 2026-02-05 15:31 UTC (permalink / raw)
To: Dave Hansen, David Hildenbrand (Arm), Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2026/2/5 23:09, Dave Hansen wrote:
> On 2/5/26 07:01, Lance Yang wrote:
>> So for now, neither approach looks good: tracking on the read side adss
>> cost to GUP-fast, and syncing on the write side e.g. synchronize_rcu()
>> is too slow on large systems.
>
> Which of the writers truly *need* synchronize_rcu()?
>
> What are they doing with the memory that they can't move forward unless
> it's quiescent *now*?
Without IPIs or synchronize_rcu(), IIUC, we have no way to know if there
are ongoing concurrent lockless page-table walks — the walkers just disable
IRQs and walk.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 15:31 ` Lance Yang
@ 2026-02-05 15:41 ` Dave Hansen
2026-02-05 16:30 ` Lance Yang
0 siblings, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2026-02-05 15:41 UTC (permalink / raw)
To: Lance Yang, David Hildenbrand (Arm), Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2/5/26 07:31, Lance Yang wrote:
> On 2026/2/5 23:09, Dave Hansen wrote:
>> On 2/5/26 07:01, Lance Yang wrote:
>>> So for now, neither approach looks good: tracking on the read side adss
>>> cost to GUP-fast, and syncing on the write side e.g. synchronize_rcu()
>>> is too slow on large systems.
>>
>> Which of the writers truly *need* synchronize_rcu()?
>>
>> What are they doing with the memory that they can't move forward unless
>> it's quiescent *now*?
>
> Without IPIs or synchronize_rcu(), IIUC, we have no way to know if there
> are ongoing concurrent lockless page-table walks — the walkers just disable
> IRQs and walk.
Yeah, but one aim of RCU is ensuring that readers see valid data but not
necessarily the most up to date data.
Are there cases where ongoing concurrent lockless page-table walks need
to see the writes and they can't tolerate seeing valid but slightly
stale data?
Don't forget that we also have pesky concurrent lockless page-table
walkers called CPUs. They're extra pesky in that they don't even stop
for IPIs. ;)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 15:41 ` Dave Hansen
@ 2026-02-05 16:30 ` Lance Yang
2026-02-05 16:46 ` David Hildenbrand (Arm)
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Lance Yang @ 2026-02-05 16:30 UTC (permalink / raw)
To: Dave Hansen, David Hildenbrand (Arm), Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2026/2/5 23:41, Dave Hansen wrote:
> On 2/5/26 07:31, Lance Yang wrote:
>> On 2026/2/5 23:09, Dave Hansen wrote:
>>> On 2/5/26 07:01, Lance Yang wrote:
>>>> So for now, neither approach looks good: tracking on the read side adss
>>>> cost to GUP-fast, and syncing on the write side e.g. synchronize_rcu()
>>>> is too slow on large systems.
>>>
>>> Which of the writers truly *need* synchronize_rcu()?
>>>
>>> What are they doing with the memory that they can't move forward unless
>>> it's quiescent *now*?
>>
>> Without IPIs or synchronize_rcu(), IIUC, we have no way to know if there
>> are ongoing concurrent lockless page-table walks — the walkers just disable
>> IRQs and walk.
>
> Yeah, but one aim of RCU is ensuring that readers see valid data but not
> necessarily the most up to date data.
>
> Are there cases where ongoing concurrent lockless page-table walks need
> to see the writes and they can't tolerate seeing valid but slightly
> stale data?
The issue is we're about to free the page table (e.g.
pmdp_collapse_flush()).
We have to ensure no walker is still doing a lockless page-table walk
when the page directories are freed, otherwise we get use-after-free.
> Don't forget that we also have pesky concurrent lockless page-table
> walkers called CPUs. They're extra pesky in that they don't even stop
> for IPIs. ;)
I assume those walkers that don't disable IRQs only read the PMD and
don't walk into the table; otherwise the current sync wouldn't work
for them.
Thanks,
Lance
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 16:30 ` Lance Yang
@ 2026-02-05 16:46 ` David Hildenbrand (Arm)
2026-02-05 16:48 ` Matthew Wilcox
2026-02-05 17:00 ` Dave Hansen
2 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 16:46 UTC (permalink / raw)
To: Lance Yang, Dave Hansen, Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2/5/26 17:30, Lance Yang wrote:
>
>
> On 2026/2/5 23:41, Dave Hansen wrote:
>> On 2/5/26 07:31, Lance Yang wrote:
>>>
>>> Without IPIs or synchronize_rcu(), IIUC, we have no way to know if there
>>> are ongoing concurrent lockless page-table walks — the walkers just
>>> disable
>>> IRQs and walk.
>>
>> Yeah, but one aim of RCU is ensuring that readers see valid data but not
>> necessarily the most up to date data.
>>
>> Are there cases where ongoing concurrent lockless page-table walks need
>> to see the writes and they can't tolerate seeing valid but slightly
>> stale data?
>
> The issue is we're about to free the page table (e.g.
> pmdp_collapse_flush()).
>
> We have to ensure no walker is still doing a lockless page-table walk
> when the page directories are freed, otherwise we get use-after-free.
Right, and walking a page table that is suddenly no longer a page table
is the real fun :)
... or trying to lookup the page of something that is not even a page.
>
>> Don't forget that we also have pesky concurrent lockless page-table
>> walkers called CPUs. They're extra pesky in that they don't even stop
>> for IPIs. ;)
>
> I assume those walkers that don't disable IRQs only read the PMD and
> don't walk into the table; otherwise the current sync wouldn't work
> for them.
CPU page table walkers are much easier to control in that regard :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 16:30 ` Lance Yang
2026-02-05 16:46 ` David Hildenbrand (Arm)
@ 2026-02-05 16:48 ` Matthew Wilcox
2026-02-05 17:06 ` David Hildenbrand (Arm)
2026-02-05 17:00 ` Dave Hansen
2 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2026-02-05 16:48 UTC (permalink / raw)
To: Lance Yang
Cc: Dave Hansen, David Hildenbrand (Arm),
Peter Zijlstra, Liam.Howlett, akpm, aneesh.kumar, arnd, baohua,
baolin.wang, boris.ostrovsky, bp, dave.hansen, dev.jain, hpa,
hughd, ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On Fri, Feb 06, 2026 at 12:30:56AM +0800, Lance Yang wrote:
> On 2026/2/5 23:41, Dave Hansen wrote:
> > Yeah, but one aim of RCU is ensuring that readers see valid data but not
> > necessarily the most up to date data.
> >
> > Are there cases where ongoing concurrent lockless page-table walks need
> > to see the writes and they can't tolerate seeing valid but slightly
> > stale data?
>
> The issue is we're about to free the page table (e.g.
> pmdp_collapse_flush()).
>
> We have to ensure no walker is still doing a lockless page-table walk
> when the page directories are freed, otherwise we get use-after-free.
But can't we RCU-free the page table? Why do we need to wait for the
RCU readers to finish?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 16:48 ` Matthew Wilcox
@ 2026-02-05 17:06 ` David Hildenbrand (Arm)
2026-02-05 18:36 ` Dave Hansen
2026-02-05 21:30 ` David Hildenbrand (Arm)
0 siblings, 2 replies; 35+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 17:06 UTC (permalink / raw)
To: Matthew Wilcox, Lance Yang
Cc: Dave Hansen, Peter Zijlstra, Liam.Howlett, akpm, aneesh.kumar,
arnd, baohua, baolin.wang, boris.ostrovsky, bp, dave.hansen,
dev.jain, hpa, hughd, ioworker0, jannh, jgross, kvm, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
pbonzini, riel, ryan.roberts, seanjc, shy828301, tglx,
virtualization, will, x86, ypodemsk, ziy
On 2/5/26 17:48, Matthew Wilcox wrote:
> On Fri, Feb 06, 2026 at 12:30:56AM +0800, Lance Yang wrote:
>> On 2026/2/5 23:41, Dave Hansen wrote:
>>> Yeah, but one aim of RCU is ensuring that readers see valid data but not
>>> necessarily the most up to date data.
>>>
>>> Are there cases where ongoing concurrent lockless page-table walks need
>>> to see the writes and they can't tolerate seeing valid but slightly
>>> stale data?
>>
>> The issue is we're about to free the page table (e.g.
>> pmdp_collapse_flush()).
>>
>> We have to ensure no walker is still doing a lockless page-table walk
>> when the page directories are freed, otherwise we get use-after-free.
>
> But can't we RCU-free the page table? Why do we need to wait for the
> RCU readers to finish?
For unsharing hugetlb PMD tables the problem is not the freeing but the
reuse of the PMD table for other purposes in the last remaining user.
It's complicated.
For page table freeing, we only do it if we fail to allocate memory --
if we cannot use RCU IIRC.
khugepaged, no idea.
--
Cheers,
David
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 17:06 ` David Hildenbrand (Arm)
@ 2026-02-05 18:36 ` Dave Hansen
2026-02-05 22:49 ` David Hildenbrand (Arm)
2026-02-05 21:30 ` David Hildenbrand (Arm)
1 sibling, 1 reply; 35+ messages in thread
From: Dave Hansen @ 2026-02-05 18:36 UTC (permalink / raw)
To: David Hildenbrand (Arm), Matthew Wilcox, Lance Yang
Cc: Peter Zijlstra, Liam.Howlett, akpm, aneesh.kumar, arnd, baohua,
baolin.wang, boris.ostrovsky, bp, dave.hansen, dev.jain, hpa,
hughd, ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2/5/26 09:06, David Hildenbrand (Arm) wrote:
>> But can't we RCU-free the page table? Why do we need to wait for the
>> RCU readers to finish?
>
> For unsharing hugetlb PMD tables the problem is not the freeing but the
> reuse of the PMD table for other purposes in the last remaining user.
> It's complicated.
Letting the previously-shared table get released to everything else in
the system sounds like a fixable problem. tlb_flush_unshared_tables()
talks about this, and it makes sense that once locks get dropped that
something else could get mapped in and start using the PMD.
The RCU way of fixing that would be to allocate new page table, replace
the old one, and RCU-free the old one. Read, Copy, Update. :)
It does temporarily eat up an extra page, and cost an extra copy. But
neither of those seems expensive compared to IPI'ing the world.
> For page table freeing, we only do it if we fail to allocate memory --
> if we cannot use RCU IIRC.
But that case is fine to be slow and use synchronize_rcu(). If you're
failing to allocate a single page, you're in a way slow path anyway.
> khugepaged, no idea.
Sounds like Lance has some homework to do there. :)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 18:36 ` Dave Hansen
@ 2026-02-05 22:49 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 22:49 UTC (permalink / raw)
To: Dave Hansen, Matthew Wilcox, Lance Yang
Cc: Peter Zijlstra, Liam.Howlett, akpm, aneesh.kumar, arnd, baohua,
baolin.wang, boris.ostrovsky, bp, dave.hansen, dev.jain, hpa,
hughd, ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2/5/26 19:36, Dave Hansen wrote:
> On 2/5/26 09:06, David Hildenbrand (Arm) wrote:
>>> But can't we RCU-free the page table? Why do we need to wait for the
>>> RCU readers to finish?
>>
>> For unsharing hugetlb PMD tables the problem is not the freeing but the
>> reuse of the PMD table for other purposes in the last remaining user.
>> It's complicated.
>
> Letting the previously-shared table get released to everything else in
> the system sounds like a fixable problem. tlb_flush_unshared_tables()
> talks about this, and it makes sense that once locks get dropped that
> something else could get mapped in and start using the PMD.
Yeah, I tried to document that carefully.
>
> The RCU way of fixing that would be to allocate new page table, replace
> the old one, and RCU-free the old one. Read, Copy, Update. :)
>
> It does temporarily eat up an extra page, and cost an extra copy. But
> neither of those seems expensive compared to IPI'ing the world.
I played with many such ideas, including never reusing a page table
again once it was once shared. All turned out rather horrible.
RCU-way: replacing a shared page table involves updating all processes
that share the page table :/ . I think another issue I stumbled into
while trying to implement was around failing to allocate memory (but
being required to make progress). It all turned to quite some complexity
and inefficiency, so I had to give up on that. :)
>
>> For page table freeing, we only do it if we fail to allocate memory --
>> if we cannot use RCU IIRC.
>
> But that case is fine to be slow and use synchronize_rcu(). If you're
> failing to allocate a single page, you're in a way slow path anyway.
That's true. We could likely do that already and avoid the IPI broadcast
there that was once reported to be a problem for RT applications.
--
Cheers,
David
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 17:06 ` David Hildenbrand (Arm)
2026-02-05 18:36 ` Dave Hansen
@ 2026-02-05 21:30 ` David Hildenbrand (Arm)
1 sibling, 0 replies; 35+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-05 21:30 UTC (permalink / raw)
To: Matthew Wilcox, Lance Yang
Cc: Dave Hansen, Peter Zijlstra, Liam.Howlett, akpm, aneesh.kumar,
arnd, baohua, baolin.wang, boris.ostrovsky, bp, dave.hansen,
dev.jain, hpa, hughd, ioworker0, jannh, jgross, kvm, linux-arch,
linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
pbonzini, riel, ryan.roberts, seanjc, shy828301, tglx,
virtualization, will, x86, ypodemsk, ziy
On 2/5/26 18:06, David Hildenbrand (Arm) wrote:
> On 2/5/26 17:48, Matthew Wilcox wrote:
>> On Fri, Feb 06, 2026 at 12:30:56AM +0800, Lance Yang wrote:
>>>
>>> The issue is we're about to free the page table (e.g.
>>> pmdp_collapse_flush()).
>>>
>>> We have to ensure no walker is still doing a lockless page-table walk
>>> when the page directories are freed, otherwise we get use-after-free.
>>
>> But can't we RCU-free the page table? Why do we need to wait for the
>> RCU readers to finish?
>
> For unsharing hugetlb PMD tables the problem is not the freeing but the
> reuse of the PMD table for other purposes in the last remaining user.
> It's complicated.
>
> For page table freeing, we only do it if we fail to allocate memory --
> if we cannot use RCU IIRC.
>
> khugepaged, no idea.
Now that I had dinner my memory comes back: for khugepaged, we have to
make sure there is no concurrent GUP-fast before collapsing and
(possibly) freeing the page table / re-depositing it.
--
Cheers,
David
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH v4 0/3] targeted TLB sync IPIs for lockless page table
2026-02-05 16:30 ` Lance Yang
2026-02-05 16:46 ` David Hildenbrand (Arm)
2026-02-05 16:48 ` Matthew Wilcox
@ 2026-02-05 17:00 ` Dave Hansen
2 siblings, 0 replies; 35+ messages in thread
From: Dave Hansen @ 2026-02-05 17:00 UTC (permalink / raw)
To: Lance Yang, David Hildenbrand (Arm), Peter Zijlstra
Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang,
boris.ostrovsky, bp, dave.hansen, dev.jain, hpa, hughd,
ioworker0, jannh, jgross, kvm, linux-arch, linux-kernel,
linux-mm, lorenzo.stoakes, mingo, npache, npiggin, pbonzini,
riel, ryan.roberts, seanjc, shy828301, tglx, virtualization,
will, x86, ypodemsk, ziy
On 2/5/26 08:30, Lance Yang wrote:
...
>> Are there cases where ongoing concurrent lockless page-table walks need
>> to see the writes and they can't tolerate seeing valid but slightly
>> stale data?
>
> The issue is we're about to free the page table (e.g.
> pmdp_collapse_flush()).
>
> We have to ensure no walker is still doing a lockless page-table walk
> when the page directories are freed, otherwise we get use-after-free.
But isn't this already solved by the existing RCU freeing approach
documented above tlb_remove_table_smp_sync()?
This seems like a rather classic way to use RCU: wait to free until RCU
says there can't be a reader any more. You don't have to sit there and
wait for it, you just use call_rcu() which will hold off the free until
it's safe.
^ permalink raw reply [flat|nested] 35+ messages in thread