[PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
@ 2026-02-24  3:07 Lance Yang
  2026-02-24 11:04 ` David Hildenbrand (Arm)
  2026-02-24 11:41 ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Lance Yang @ 2026-02-24  3:07 UTC (permalink / raw)
  To: akpm, peterz
  Cc: david, dave.hansen, will, aneesh.kumar, npiggin, linux-arch,
	linux-mm, linux-kernel, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

When freeing page tables, we try to batch them. If batch allocation fails
(GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
batching.

On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
process is unmapping memory. IPI broadcast was reported to hurt RT
workloads[1].

tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
(e.g. GUP-fast) that rely on IRQ disabling. These walkers use
local_irq_disable(), which is also an RCU read-side critical section.

This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
period (synchronize_rcu()) instead of IPI broadcast. This provides the
same guarantee as IPI but without disrupting all CPUs. Since batch
allocation already failed, we are in a way slow path where sleeping is
acceptable - we are in process context (unmap_region, exit_mmap) with only
mmap_lock held. might_sleep() will catch any invalid context.

[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/

Link: https://lore.kernel.org/linux-mm/20260202150957.GD1282955@noisy.programming.kicks-ass.net/
Link: https://lore.kernel.org/linux-mm/dfdfeac9-5cd5-46fc-a5c1-9ccf9bd3502a@intel.com/
Link: https://lore.kernel.org/linux-mm/bc489455-bb18-44dc-8518-ae75abda6bec@kernel.org/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
v1 -> v2:
- Wrap synchronize_rcu() in tlb_remove_table_sync_rcu() with proper
  kerneldoc (per David)
- Add might_sleep() to make sleeping constraint explicit (per Dave)
- Clarify this is for synchronization, not memory freeing (per Dave)
- https://lore.kernel.org/linux-mm/20260223033604.10198-1-lance.yang@linux.dev/

 include/asm-generic/tlb.h |  4 ++++
 mm/mmu_gather.c           | 22 +++++++++++++++++++++-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 4aeac0c3d3f0..bdcc2778ac64 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -251,6 +251,8 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
 
 void tlb_remove_table_sync_one(void);
 
+void tlb_remove_table_sync_rcu(void);
+
 #else
 
 #ifdef tlb_needs_table_invalidate
@@ -259,6 +261,8 @@ void tlb_remove_table_sync_one(void);
 
 static inline void tlb_remove_table_sync_one(void) { }
 
+static inline void tlb_remove_table_sync_rcu(void) { }
+
 #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
 
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index fe5b6a031717..2c6fa8db55df 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -296,6 +296,26 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch)
 	call_rcu(&batch->rcu, tlb_remove_table_rcu);
 }
 
+/**
+ * tlb_remove_table_sync_rcu() - synchronize with software page-table walkers
+ *
+ * Like tlb_remove_table_sync_one() but uses RCU grace period instead of IPI
+ * broadcast. Use in slow paths where sleeping is acceptable.
+ *
+ * Software/Lockless page-table walkers use local_irq_disable(), which is also
+ * an RCU read-side critical section. synchronize_rcu() waits for all such
+ * sections, providing the same guarantee as tlb_remove_table_sync_one() but
+ * without disrupting all CPUs with IPIs.
+ *
+ * Do not use for freeing memory. Use RCU callbacks instead to avoid latency
+ * spikes. Cannot be called from any atomic context.
+ */
+void tlb_remove_table_sync_rcu(void)
+{
+	might_sleep();
+	synchronize_rcu();
+}
+
 #else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
 static void tlb_remove_table_free(struct mmu_table_batch *batch)
@@ -339,7 +359,7 @@ static inline void __tlb_remove_table_one(void *table)
 #else
 static inline void __tlb_remove_table_one(void *table)
 {
-	tlb_remove_table_sync_one();
+	tlb_remove_table_sync_rcu();
 	__tlb_remove_table(table);
 }
 #endif /* CONFIG_PT_RECLAIM */
-- 
2.49.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
  2026-02-24  3:07 [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails Lance Yang
@ 2026-02-24 11:04 ` David Hildenbrand (Arm)
  2026-02-24 11:32   ` Lance Yang
  2026-02-24 11:41 ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand (Arm) @ 2026-02-24 11:04 UTC (permalink / raw)
  To: Lance Yang, akpm, peterz
  Cc: dave.hansen, will, aneesh.kumar, npiggin, linux-arch, linux-mm,
	linux-kernel

On 2/24/26 04:07, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> When freeing page tables, we try to batch them. If batch allocation fails
> (GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
> batching.
> 
> On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
> tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
> process is unmapping memory. IPI broadcast was reported to hurt RT
> workloads[1].
> 
> tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
> (e.g. GUP-fast) that rely on IRQ disabling. These walkers use
> local_irq_disable(), which is also an RCU read-side critical section.
> 
> This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
> period (synchronize_rcu()) instead of IPI broadcast. This provides the
> same guarantee as IPI but without disrupting all CPUs. Since batch
> allocation already failed, we are in a way slow path where sleeping is
> acceptable - we are in process context (unmap_region, exit_mmap) with only
> mmap_lock held. might_sleep() will catch any invalid context.
> 
> [1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
> 
> Link: https://lore.kernel.org/linux-mm/20260202150957.GD1282955@noisy.programming.kicks-ass.net/
> Link: https://lore.kernel.org/linux-mm/dfdfeac9-5cd5-46fc-a5c1-9ccf9bd3502a@intel.com/
> Link: https://lore.kernel.org/linux-mm/bc489455-bb18-44dc-8518-ae75abda6bec@kernel.org/
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
> v1 -> v2:
> - Wrap synchronize_rcu() in tlb_remove_table_sync_rcu() with proper
>   kerneldoc (per David)
> - Add might_sleep() to make sleeping constraint explicit (per Dave)
> - Clarify this is for synchronization, not memory freeing (per Dave)
> - https://lore.kernel.org/linux-mm/20260223033604.10198-1-lance.yang@linux.dev/
> 
>  include/asm-generic/tlb.h |  4 ++++
>  mm/mmu_gather.c           | 22 +++++++++++++++++++++-
>  2 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 4aeac0c3d3f0..bdcc2778ac64 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -251,6 +251,8 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>  
>  void tlb_remove_table_sync_one(void);
>  
> +void tlb_remove_table_sync_rcu(void);
> +
>  #else
>  
>  #ifdef tlb_needs_table_invalidate
> @@ -259,6 +261,8 @@ void tlb_remove_table_sync_one(void);
>  
>  static inline void tlb_remove_table_sync_one(void) { }
>  
> +static inline void tlb_remove_table_sync_rcu(void) { }
> +
>  #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
>  
>  
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index fe5b6a031717..2c6fa8db55df 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -296,6 +296,26 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch)
>  	call_rcu(&batch->rcu, tlb_remove_table_rcu);
>  }
>  
> +/**
> + * tlb_remove_table_sync_rcu() - synchronize with software page-table walkers

Nit: no need for the "()"

Thanks!

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
  2026-02-24 11:04 ` David Hildenbrand (Arm)
@ 2026-02-24 11:32   ` Lance Yang
  0 siblings, 0 replies; 9+ messages in thread
From: Lance Yang @ 2026-02-24 11:32 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, peterz
  Cc: dave.hansen, will, aneesh.kumar, npiggin, linux-arch, linux-mm,
	linux-kernel



On 2026/2/24 19:04, David Hildenbrand (Arm) wrote:
> On 2/24/26 04:07, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> When freeing page tables, we try to batch them. If batch allocation fails
>> (GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
>> batching.
>>
>> On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
>> tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
>> process is unmapping memory. IPI broadcast was reported to hurt RT
>> workloads[1].
>>
>> tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
>> (e.g. GUP-fast) that rely on IRQ disabling. These walkers use
>> local_irq_disable(), which is also an RCU read-side critical section.
>>
>> This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
>> period (synchronize_rcu()) instead of IPI broadcast. This provides the
>> same guarantee as IPI but without disrupting all CPUs. Since batch
>> allocation already failed, we are in a way slow path where sleeping is
>> acceptable - we are in process context (unmap_region, exit_mmap) with only
>> mmap_lock held. might_sleep() will catch any invalid context.
>>
>> [1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
>>
>> Link: https://lore.kernel.org/linux-mm/20260202150957.GD1282955@noisy.programming.kicks-ass.net/
>> Link: https://lore.kernel.org/linux-mm/dfdfeac9-5cd5-46fc-a5c1-9ccf9bd3502a@intel.com/
>> Link: https://lore.kernel.org/linux-mm/bc489455-bb18-44dc-8518-ae75abda6bec@kernel.org/
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>> v1 -> v2:
>> - Wrap synchronize_rcu() in tlb_remove_table_sync_rcu() with proper
>>    kerneldoc (per David)
>> - Add might_sleep() to make sleeping constraint explicit (per Dave)
>> - Clarify this is for synchronization, not memory freeing (per Dave)
>> - https://lore.kernel.org/linux-mm/20260223033604.10198-1-lance.yang@linux.dev/
>>
>>   include/asm-generic/tlb.h |  4 ++++
>>   mm/mmu_gather.c           | 22 +++++++++++++++++++++-
>>   2 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 4aeac0c3d3f0..bdcc2778ac64 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -251,6 +251,8 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>>   
>>   void tlb_remove_table_sync_one(void);
>>   
>> +void tlb_remove_table_sync_rcu(void);
>> +
>>   #else
>>   
>>   #ifdef tlb_needs_table_invalidate
>> @@ -259,6 +261,8 @@ void tlb_remove_table_sync_one(void);
>>   
>>   static inline void tlb_remove_table_sync_one(void) { }
>>   
>> +static inline void tlb_remove_table_sync_rcu(void) { }
>> +
>>   #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
>>   
>>   
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index fe5b6a031717..2c6fa8db55df 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -296,6 +296,26 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch)
>>   	call_rcu(&batch->rcu, tlb_remove_table_rcu);
>>   }
>>   
>> +/**
>> + * tlb_remove_table_sync_rcu() - synchronize with software page-table walkers
> 
> Nit: no need for the "()"

Oops, @Andrew could you drop the "()" here when applying?

> 
> Thanks!
> 
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Thanks for taking time to review!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
  2026-02-24  3:07 [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails Lance Yang
  2026-02-24 11:04 ` David Hildenbrand (Arm)
@ 2026-02-24 11:41 ` Peter Zijlstra
  2026-02-24 11:55   ` Peter Zijlstra
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Peter Zijlstra @ 2026-02-24 11:41 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, dave.hansen, will, aneesh.kumar, npiggin,
	linux-arch, linux-mm, linux-kernel

On Tue, Feb 24, 2026 at 11:07:00AM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> When freeing page tables, we try to batch them. If batch allocation fails
> (GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
> batching.
> 
> On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
> tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
> process is unmapping memory. IPI broadcast was reported to hurt RT
> workloads[1].
> 
> tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
> (e.g. GUP-fast) that rely on IRQ disabling. These walkers use
> local_irq_disable(), which is also an RCU read-side critical section.
> 
> This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
> period (synchronize_rcu()) instead of IPI broadcast. This provides the
> same guarantee as IPI but without disrupting all CPUs. Since batch
> allocation already failed, we are in a way slow path where sleeping is
> acceptable - we are in process context (unmap_region, exit_mmap) with only
> mmap_lock held. might_sleep() will catch any invalid context.

So sending the IPIs also requires non-atomic context, so change there.

What isn't explained, and very much not clear to me, is why
tlb_remove_table_sync_one() is retained?


> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 4aeac0c3d3f0..bdcc2778ac64 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -251,6 +251,8 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>  
>  void tlb_remove_table_sync_one(void);
>  
> +void tlb_remove_table_sync_rcu(void);
> +
>  #else
>  
>  #ifdef tlb_needs_table_invalidate
> @@ -259,6 +261,8 @@ void tlb_remove_table_sync_one(void);
>  
>  static inline void tlb_remove_table_sync_one(void) { }
>  
> +static inline void tlb_remove_table_sync_rcu(void) { }
> +
>  #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
>  
>  
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index fe5b6a031717..2c6fa8db55df 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -296,6 +296,26 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch)
>  	call_rcu(&batch->rcu, tlb_remove_table_rcu);
>  }
>  
> +/**
> + * tlb_remove_table_sync_rcu() - synchronize with software page-table walkers
> + *
> + * Like tlb_remove_table_sync_one() but uses RCU grace period instead of IPI
> + * broadcast. Use in slow paths where sleeping is acceptable.
> + *
> + * Software/Lockless page-table walkers use local_irq_disable(), which is also
> + * an RCU read-side critical section. synchronize_rcu() waits for all such
> + * sections, providing the same guarantee as tlb_remove_table_sync_one() but
> + * without disrupting all CPUs with IPIs.
> + *
> + * Do not use for freeing memory. Use RCU callbacks instead to avoid latency
> + * spikes. Cannot be called from any atomic context.
> + */
> +void tlb_remove_table_sync_rcu(void)
> +{
> +	might_sleep();
> +	synchronize_rcu();

synchronize_rcu() should end up in a might_sleep() at some point if it
blocks (which it typically will).

> +}
> +
>  #else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */
>  
>  static void tlb_remove_table_free(struct mmu_table_batch *batch)
> @@ -339,7 +359,7 @@ static inline void __tlb_remove_table_one(void *table)
>  #else
>  static inline void __tlb_remove_table_one(void *table)
>  {
> -	tlb_remove_table_sync_one();
> +	tlb_remove_table_sync_rcu();
>  	__tlb_remove_table(table);
>  }
>  #endif /* CONFIG_PT_RECLAIM */
> -- 
> 2.49.0
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
  2026-02-24 11:41 ` Peter Zijlstra
@ 2026-02-24 11:55   ` Peter Zijlstra
  2026-02-24 12:18   ` Lance Yang
  2026-02-24 15:04   ` Dave Hansen
  2 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2026-02-24 11:55 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, dave.hansen, will, aneesh.kumar, npiggin,
	linux-arch, linux-mm, linux-kernel

On Tue, Feb 24, 2026 at 12:41:52PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 24, 2026 at 11:07:00AM +0800, Lance Yang wrote:
> > From: Lance Yang <lance.yang@linux.dev>
> > 
> > When freeing page tables, we try to batch them. If batch allocation fails
> > (GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
> > batching.
> > 
> > On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
> > tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
> > process is unmapping memory. IPI broadcast was reported to hurt RT
> > workloads[1].
> > 
> > tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
> > (e.g. GUP-fast) that rely on IRQ disabling. These walkers use
> > local_irq_disable(), which is also an RCU read-side critical section.
> > 
> > This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
> > period (synchronize_rcu()) instead of IPI broadcast. This provides the
> > same guarantee as IPI but without disrupting all CPUs. Since batch
> > allocation already failed, we are in a way slow path where sleeping is
> > acceptable - we are in process context (unmap_region, exit_mmap) with only
> > mmap_lock held. might_sleep() will catch any invalid context.
> 
> So sending the IPIs also requires non-atomic context, so change there.

s/so/no/ -- typing hard


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
  2026-02-24 11:41 ` Peter Zijlstra
  2026-02-24 11:55   ` Peter Zijlstra
@ 2026-02-24 12:18   ` Lance Yang
  2026-02-24 12:35     ` Peter Zijlstra
  2026-02-24 15:04   ` Dave Hansen
  2 siblings, 1 reply; 9+ messages in thread
From: Lance Yang @ 2026-02-24 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, david, dave.hansen, will, aneesh.kumar, npiggin,
	linux-arch, linux-mm, linux-kernel



On 2026/2/24 19:41, Peter Zijlstra wrote:
> On Tue, Feb 24, 2026 at 11:07:00AM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> When freeing page tables, we try to batch them. If batch allocation fails
>> (GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
>> batching.
>>
>> On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
>> tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
>> process is unmapping memory. IPI broadcast was reported to hurt RT
>> workloads[1].
>>
>> tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
>> (e.g. GUP-fast) that rely on IRQ disabling. These walkers use
>> local_irq_disable(), which is also an RCU read-side critical section.
>>
>> This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
>> period (synchronize_rcu()) instead of IPI broadcast. This provides the
>> same guarantee as IPI but without disrupting all CPUs. Since batch
>> allocation already failed, we are in a way slow path where sleeping is
>> acceptable - we are in process context (unmap_region, exit_mmap) with only
>> mmap_lock held. might_sleep() will catch any invalid context.
> 
> So sending the IPIs also requires non-atomic context, so change there.

Yeah, you're right!

> What isn't explained, and very much not clear to me, is why
> tlb_remove_table_sync_one() is retained?

Good point. tlb_remove_table_sync_one() is still needed in:

1) khugepaged (mm/khugepaged.c) - after pmdp_collapse_flush()
2) tlb_finish_mmu() (tlb.h) - when tlb->fully_unshared_tables
3) ...

These are not slow paths like batch allocation failure. This patch only
converts this obvious slow path first.

I'm working on converting the remaining callers as well, but not with
RCU, looking at other options (e.g. targeted IPI).

> 
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 4aeac0c3d3f0..bdcc2778ac64 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -251,6 +251,8 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>>   
>>   void tlb_remove_table_sync_one(void);
>>   
>> +void tlb_remove_table_sync_rcu(void);
>> +
>>   #else
>>   
>>   #ifdef tlb_needs_table_invalidate
>> @@ -259,6 +261,8 @@ void tlb_remove_table_sync_one(void);
>>   
>>   static inline void tlb_remove_table_sync_one(void) { }
>>   
>> +static inline void tlb_remove_table_sync_rcu(void) { }
>> +
>>   #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
>>   
>>   
>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
>> index fe5b6a031717..2c6fa8db55df 100644
>> --- a/mm/mmu_gather.c
>> +++ b/mm/mmu_gather.c
>> @@ -296,6 +296,26 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch)
>>   	call_rcu(&batch->rcu, tlb_remove_table_rcu);
>>   }
>>   
>> +/**
>> + * tlb_remove_table_sync_rcu() - synchronize with software page-table walkers
>> + *
>> + * Like tlb_remove_table_sync_one() but uses RCU grace period instead of IPI
>> + * broadcast. Use in slow paths where sleeping is acceptable.
>> + *
>> + * Software/Lockless page-table walkers use local_irq_disable(), which is also
>> + * an RCU read-side critical section. synchronize_rcu() waits for all such
>> + * sections, providing the same guarantee as tlb_remove_table_sync_one() but
>> + * without disrupting all CPUs with IPIs.
>> + *
>> + * Do not use for freeing memory. Use RCU callbacks instead to avoid latency
>> + * spikes. Cannot be called from any atomic context.
>> + */
>> +void tlb_remove_table_sync_rcu(void)
>> +{
>> +	might_sleep();
>> +	synchronize_rcu();
> 
> synchronize_rcu() should end up in a might_sleep() at some point if it
> blocks (which it typically will).

Right, will drop the explicit might_sleep() and "Cannot be called from
any atomic context" from kerneldoc since both have the same requirements.

Thanks,
Lance


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
  2026-02-24 12:18   ` Lance Yang
@ 2026-02-24 12:35     ` Peter Zijlstra
  2026-02-24 12:57       ` Lance Yang
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2026-02-24 12:35 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, dave.hansen, will, aneesh.kumar, npiggin,
	linux-arch, linux-mm, linux-kernel

On Tue, Feb 24, 2026 at 08:18:46PM +0800, Lance Yang wrote:
> 
> 
> On 2026/2/24 19:41, Peter Zijlstra wrote:
> > On Tue, Feb 24, 2026 at 11:07:00AM +0800, Lance Yang wrote:
> > > From: Lance Yang <lance.yang@linux.dev>
> > > 
> > > When freeing page tables, we try to batch them. If batch allocation fails
> > > (GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
> > > batching.
> > > 
> > > On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
> > > tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
> > > process is unmapping memory. IPI broadcast was reported to hurt RT
> > > workloads[1].
> > > 
> > > tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
> > > (e.g. GUP-fast) that rely on IRQ disabling. These walkers use
> > > local_irq_disable(), which is also an RCU read-side critical section.
> > > 
> > > This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
> > > period (synchronize_rcu()) instead of IPI broadcast. This provides the
> > > same guarantee as IPI but without disrupting all CPUs. Since batch
> > > allocation already failed, we are in a way slow path where sleeping is
> > > acceptable - we are in process context (unmap_region, exit_mmap) with only
> > > mmap_lock held. might_sleep() will catch any invalid context.
> > 
> > So sending the IPIs also requires non-atomic context, so change there.
> 
> Yeah, you're right!
> 
> > What isn't explained, and very much not clear to me, is why
> > tlb_remove_table_sync_one() is retained?
> 
> Good point. tlb_remove_table_sync_one() is still needed in:
> 
> 1) khugepaged (mm/khugepaged.c) - after pmdp_collapse_flush()
> 2) tlb_finish_mmu() (tlb.h) - when tlb->fully_unshared_tables
> 3) ...
> 
> These are not slow paths like batch allocation failure. This patch only
> converts this obvious slow path first.
> 
> I'm working on converting the remaining callers as well, but not with
> RCU, looking at other options (e.g. targeted IPI).

OK, so with that addition to the Changelog,

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
  2026-02-24 12:35     ` Peter Zijlstra
@ 2026-02-24 12:57       ` Lance Yang
  0 siblings, 0 replies; 9+ messages in thread
From: Lance Yang @ 2026-02-24 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm, david, dave.hansen, will, aneesh.kumar, npiggin,
	linux-arch, linux-mm, linux-kernel



On 2026/2/24 20:35, Peter Zijlstra wrote:
> On Tue, Feb 24, 2026 at 08:18:46PM +0800, Lance Yang wrote:
>>
>>
>> On 2026/2/24 19:41, Peter Zijlstra wrote:
>>> On Tue, Feb 24, 2026 at 11:07:00AM +0800, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> When freeing page tables, we try to batch them. If batch allocation fails
>>>> (GFP_NOWAIT), __tlb_remove_table_one() immediately frees the one without
>>>> batching.
>>>>
>>>> On !CONFIG_PT_RECLAIM, the fallback sends an IPI to all CPUs via
>>>> tlb_remove_table_sync_one(). It disrupts all CPUs even when only a single
>>>> process is unmapping memory. IPI broadcast was reported to hurt RT
>>>> workloads[1].
>>>>
>>>> tlb_remove_table_sync_one() synchronizes with lockless page-table walkers
>>>> (e.g. GUP-fast) that rely on IRQ disabling. These walkers use
>>>> local_irq_disable(), which is also an RCU read-side critical section.
>>>>
>>>> This patch introduces tlb_remove_table_sync_rcu() which uses RCU grace
>>>> period (synchronize_rcu()) instead of IPI broadcast. This provides the
>>>> same guarantee as IPI but without disrupting all CPUs. Since batch
>>>> allocation already failed, we are in a way slow path where sleeping is
>>>> acceptable - we are in process context (unmap_region, exit_mmap) with only
>>>> mmap_lock held. might_sleep() will catch any invalid context.
>>>
>>> So sending the IPIs also requires non-atomic context, so change there.
>>
>> Yeah, you're right!
>>
>>> What isn't explained, and very much not clear to me, is why
>>> tlb_remove_table_sync_one() is retained?
>>
>> Good point. tlb_remove_table_sync_one() is still needed in:
>>
>> 1) khugepaged (mm/khugepaged.c) - after pmdp_collapse_flush()
>> 2) tlb_finish_mmu() (tlb.h) - when tlb->fully_unshared_tables
>> 3) ...
>>
>> These are not slow paths like batch allocation failure. This patch only
>> converts this obvious slow path first.
>>
>> I'm working on converting the remaining callers as well, but not with
>> RCU, looking at other options (e.g. targeted IPI).
> 
> OK, so with that addition to the Changelog,
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks for taking time to review!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails
  2026-02-24 11:41 ` Peter Zijlstra
  2026-02-24 11:55   ` Peter Zijlstra
  2026-02-24 12:18   ` Lance Yang
@ 2026-02-24 15:04   ` Dave Hansen
  2 siblings, 0 replies; 9+ messages in thread
From: Dave Hansen @ 2026-02-24 15:04 UTC (permalink / raw)
  To: Peter Zijlstra, Lance Yang
  Cc: akpm, david, will, aneesh.kumar, npiggin, linux-arch, linux-mm,
	linux-kernel

On 2/24/26 03:41, Peter Zijlstra wrote:
>> +void tlb_remove_table_sync_rcu(void)
>> +{
>> +	might_sleep();
>> +	synchronize_rcu();
> synchronize_rcu() should end up in a might_sleep() at some point if it
> blocks (which it typically will).

FWIW, I do prefer the explicit might_sleep() rather than leaving it to
just the documentation. It just makes it easier to find bugs. I'm sure
there's some crazy RCU variant that doesn't often sleep in
synchronize_rcu(). ;)

If it's worth adding a line of comment, it's worth adding a line of code
to actually keep folks honest. This is also going to be a pretty darn
slow path so it shouldn't bloat anything too much.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-24 15:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24  3:07 [PATCH v2 1/1] mm/mmu_gather: replace IPI with synchronize_rcu() when batch allocation fails Lance Yang
2026-02-24 11:04 ` David Hildenbrand (Arm)
2026-02-24 11:32   ` Lance Yang
2026-02-24 11:41 ` Peter Zijlstra
2026-02-24 11:55   ` Peter Zijlstra
2026-02-24 12:18   ` Lance Yang
2026-02-24 12:35     ` Peter Zijlstra
2026-02-24 12:57       ` Lance Yang
2026-02-24 15:04   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox