* [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency
2025-12-18 1:45 [RFC PATCH v4 0/4] Hazard Pointers Mathieu Desnoyers
@ 2025-12-18 1:45 ` Mathieu Desnoyers
2025-12-18 9:03 ` David Laight
2025-12-18 1:45 ` [RFC PATCH v4 2/4] Documentation: RCU: Refer to ptr_eq() Mathieu Desnoyers
` (3 subsequent siblings)
4 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 1:45 UTC (permalink / raw)
To: Boqun Feng, Joel Fernandes, Paul E. McKenney
Cc: linux-kernel, Mathieu Desnoyers, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm, Gary Guo,
Nikita Popov, llvm
Compiler CSE and SSA GVN optimizations can cause the address dependency
of addresses returned by rcu_dereference to be lost when comparing those
pointers with either constants or previously loaded pointers.
Introduce ptr_eq() to compare two addresses while preserving the address
dependencies for later use of the address. It should be used when
comparing an address returned by rcu_dereference().
This is needed to prevent the compiler CSE and SSA GVN optimizations
from using @a (or @b) in places where the source refers to @b (or @a)
based on the fact that after the comparison, the two are known to be
equal, which does not preserve address dependencies and allows the
following misordering speculations:
- If @b is a constant, the compiler can issue the loads which depend
on @a before loading @a.
- If @b is a register populated by a prior load, weakly-ordered
CPUs can speculate loads which depend on @a before loading @a.
The same logic applies with @a and @b swapped.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: "Paul E. McKenney" <paulmck@kernel.org>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
Cc: Nikita Popov <github@npopov.com>
Cc: llvm@lists.linux.dev
---
Changes since v0:
- Include feedback from Alan Stern.
---
include/linux/compiler.h | 63 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 5b45ea7dff3e..c5ca3b54c112 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -163,6 +163,69 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
__asm__ ("" : "=r" (var) : "0" (var))
#endif
+/*
+ * Compare two addresses while preserving the address dependencies for
+ * later use of the address. It should be used when comparing an address
+ * returned by rcu_dereference().
+ *
+ * This is needed to prevent the compiler CSE and SSA GVN optimizations
+ * from using @a (or @b) in places where the source refers to @b (or @a)
+ * based on the fact that after the comparison, the two are known to be
+ * equal, which does not preserve address dependencies and allows the
+ * following misordering speculations:
+ *
+ * - If @b is a constant, the compiler can issue the loads which depend
+ * on @a before loading @a.
+ * - If @b is a register populated by a prior load, weakly-ordered
+ * CPUs can speculate loads which depend on @a before loading @a.
+ *
+ * The same logic applies with @a and @b swapped.
+ *
+ * Return value: true if pointers are equal, false otherwise.
+ *
+ * The compiler barrier() is ineffective at fixing this issue. It does
+ * not prevent the compiler CSE from losing the address dependency:
+ *
+ * int fct_2_volatile_barriers(void)
+ * {
+ * int *a, *b;
+ *
+ * do {
+ * a = READ_ONCE(p);
+ * asm volatile ("" : : : "memory");
+ * b = READ_ONCE(p);
+ * } while (a != b);
+ * asm volatile ("" : : : "memory"); <-- barrier()
+ * return *b;
+ * }
+ *
+ * With gcc 14.2 (arm64):
+ *
+ * fct_2_volatile_barriers:
+ * adrp x0, .LANCHOR0
+ * add x0, x0, :lo12:.LANCHOR0
+ * .L2:
+ * ldr x1, [x0] <-- x1 populated by first load.
+ * ldr x2, [x0]
+ * cmp x1, x2
+ * bne .L2
+ * ldr w0, [x1] <-- x1 is used for access which should depend on b.
+ * ret
+ *
+ * On weakly-ordered architectures, this lets CPU speculation use the
+ * result from the first load to speculate "ldr w0, [x1]" before
+ * "ldr x2, [x0]".
+ * Based on the RCU documentation, the control dependency does not
+ * prevent the CPU from speculating loads.
+ */
+static __always_inline
+int ptr_eq(const volatile void *a, const volatile void *b)
+{
+ OPTIMIZER_HIDE_VAR(a);
+ OPTIMIZER_HIDE_VAR(b);
+ return a == b;
+}
+
#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
/**
--
2.39.5
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency
2025-12-18 1:45 ` [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency Mathieu Desnoyers
@ 2025-12-18 9:03 ` David Laight
2025-12-18 13:51 ` Mathieu Desnoyers
2025-12-18 14:27 ` Gary Guo
0 siblings, 2 replies; 29+ messages in thread
From: David Laight @ 2025-12-18 9:03 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, Joel Fernandes, Paul E. McKenney, linux-kernel,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Will Deacon, Peter Zijlstra,
Alan Stern, John Stultz, Neeraj Upadhyay, Linus Torvalds,
Andrew Morton, Frederic Weisbecker, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm, Gary Guo, Nikita Popov, llvm
On Wed, 17 Dec 2025 20:45:28 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> Compiler CSE and SSA GVN optimizations can cause the address dependency
> of addresses returned by rcu_dereference to be lost when comparing those
> pointers with either constants or previously loaded pointers.
>
> Introduce ptr_eq() to compare two addresses while preserving the address
> dependencies for later use of the address. It should be used when
> comparing an address returned by rcu_dereference().
>
> This is needed to prevent the compiler CSE and SSA GVN optimizations
> from using @a (or @b) in places where the source refers to @b (or @a)
> based on the fact that after the comparison, the two are known to be
> equal, which does not preserve address dependencies and allows the
> following misordering speculations:
>
> - If @b is a constant, the compiler can issue the loads which depend
> on @a before loading @a.
> - If @b is a register populated by a prior load, weakly-ordered
> CPUs can speculate loads which depend on @a before loading @a.
>
> The same logic applies with @a and @b swapped.
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Suggested-by: Boqun Feng <boqun.feng@gmail.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Acked-by: "Paul E. McKenney" <paulmck@kernel.org>
> Acked-by: Alan Stern <stern@rowland.harvard.edu>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: John Stultz <jstultz@google.com>
> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Zqiang <qiang.zhang1211@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: maged.michael@gmail.com
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Gary Guo <gary@garyguo.net>
> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> Cc: rcu@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: lkmm@lists.linux.dev
> Cc: Nikita Popov <github@npopov.com>
> Cc: llvm@lists.linux.dev
> ---
> Changes since v0:
> - Include feedback from Alan Stern.
> ---
> include/linux/compiler.h | 63 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 63 insertions(+)
>
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 5b45ea7dff3e..c5ca3b54c112 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -163,6 +163,69 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
> __asm__ ("" : "=r" (var) : "0" (var))
> #endif
>
> +/*
> + * Compare two addresses while preserving the address dependencies for
> + * later use of the address. It should be used when comparing an address
> + * returned by rcu_dereference().
> + *
> + * This is needed to prevent the compiler CSE and SSA GVN optimizations
> + * from using @a (or @b) in places where the source refers to @b (or @a)
> + * based on the fact that after the comparison, the two are known to be
> + * equal, which does not preserve address dependencies and allows the
> + * following misordering speculations:
> + *
> + * - If @b is a constant, the compiler can issue the loads which depend
> + * on @a before loading @a.
> + * - If @b is a register populated by a prior load, weakly-ordered
> + * CPUs can speculate loads which depend on @a before loading @a.
> + *
> + * The same logic applies with @a and @b swapped.
> + *
> + * Return value: true if pointers are equal, false otherwise.
> + *
> + * The compiler barrier() is ineffective at fixing this issue. It does
> + * not prevent the compiler CSE from losing the address dependency:
> + *
> + * int fct_2_volatile_barriers(void)
> + * {
> + * int *a, *b;
> + *
> + * do {
> + * a = READ_ONCE(p);
> + * asm volatile ("" : : : "memory");
> + * b = READ_ONCE(p);
> + * } while (a != b);
> + * asm volatile ("" : : : "memory"); <-- barrier()
> + * return *b;
> + * }
> + *
> + * With gcc 14.2 (arm64):
> + *
> + * fct_2_volatile_barriers:
> + * adrp x0, .LANCHOR0
> + * add x0, x0, :lo12:.LANCHOR0
> + * .L2:
> + * ldr x1, [x0] <-- x1 populated by first load.
> + * ldr x2, [x0]
> + * cmp x1, x2
> + * bne .L2
> + * ldr w0, [x1] <-- x1 is used for access which should depend on b.
> + * ret
> + *
> + * On weakly-ordered architectures, this lets CPU speculation use the
> + * result from the first load to speculate "ldr w0, [x1]" before
> + * "ldr x2, [x0]".
> + * Based on the RCU documentation, the control dependency does not
> + * prevent the CPU from speculating loads.
I'm not sure that example (of something that doesn't work) is really necessary.
The simple example of, given:
return a == b ? *a : 0;
the generated code might speculatively dereference 'b' (not a) before returning
zero when the pointers are different.
David
> + */
> +static __always_inline
> +int ptr_eq(const volatile void *a, const volatile void *b)
> +{
> + OPTIMIZER_HIDE_VAR(a);
> + OPTIMIZER_HIDE_VAR(b);
> + return a == b;
> +}
> +
> #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
>
> /**
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency
2025-12-18 9:03 ` David Laight
@ 2025-12-18 13:51 ` Mathieu Desnoyers
2025-12-18 15:54 ` David Laight
2025-12-18 14:27 ` Gary Guo
1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 13:51 UTC (permalink / raw)
To: David Laight
Cc: Boqun Feng, Joel Fernandes, Paul E. McKenney, linux-kernel,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Will Deacon, Peter Zijlstra,
Alan Stern, John Stultz, Neeraj Upadhyay, Linus Torvalds,
Andrew Morton, Frederic Weisbecker, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm, Gary Guo, Nikita Popov, llvm
On 2025-12-18 04:03, David Laight wrote:
[...]
>> + *
>> + * The compiler barrier() is ineffective at fixing this issue. It does
>> + * not prevent the compiler CSE from losing the address dependency:
>> + *
>> + * int fct_2_volatile_barriers(void)
>> + * {
>> + * int *a, *b;
>> + *
>> + * do {
>> + * a = READ_ONCE(p);
>> + * asm volatile ("" : : : "memory");
>> + * b = READ_ONCE(p);
>> + * } while (a != b);
>> + * asm volatile ("" : : : "memory"); <-- barrier()
>> + * return *b;
>> + * }
>> + *
>> + * With gcc 14.2 (arm64):
>> + *
>> + * fct_2_volatile_barriers:
>> + * adrp x0, .LANCHOR0
>> + * add x0, x0, :lo12:.LANCHOR0
>> + * .L2:
>> + * ldr x1, [x0] <-- x1 populated by first load.
>> + * ldr x2, [x0]
>> + * cmp x1, x2
>> + * bne .L2
>> + * ldr w0, [x1] <-- x1 is used for access which should depend on b.
>> + * ret
>> + *
>> + * On weakly-ordered architectures, this lets CPU speculation use the
>> + * result from the first load to speculate "ldr w0, [x1]" before
>> + * "ldr x2, [x0]".
>> + * Based on the RCU documentation, the control dependency does not
>> + * prevent the CPU from speculating loads.
>
> I'm not sure that example (of something that doesn't work) is really necessary.
> The simple example of, given:
> return a == b ? *a : 0;
> the generated code might speculatively dereference 'b' (not a) before returning
> zero when the pointers are different.
In the past discussion that led to this new API, AFAIU, Linus made it
clear that this counter example needs to be in a comment:
https://lore.kernel.org/lkml/CAHk-=wgBgh5U+dyNaN=+XCdcm2OmgSRbcH4Vbtk8i5ZDGwStSA@mail.gmail.com/
This counter-example is what convinced him that this addresses a real
issue.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency
2025-12-18 13:51 ` Mathieu Desnoyers
@ 2025-12-18 15:54 ` David Laight
0 siblings, 0 replies; 29+ messages in thread
From: David Laight @ 2025-12-18 15:54 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, Joel Fernandes, Paul E. McKenney, linux-kernel,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Will Deacon, Peter Zijlstra,
Alan Stern, John Stultz, Neeraj Upadhyay, Linus Torvalds,
Andrew Morton, Frederic Weisbecker, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm, Gary Guo, Nikita Popov, llvm
On Thu, 18 Dec 2025 08:51:02 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> On 2025-12-18 04:03, David Laight wrote:
> [...]
> >> + *
> >> + * The compiler barrier() is ineffective at fixing this issue. It does
> >> + * not prevent the compiler CSE from losing the address dependency:
> >> + *
> >> + * int fct_2_volatile_barriers(void)
> >> + * {
> >> + * int *a, *b;
> >> + *
> >> + * do {
> >> + * a = READ_ONCE(p);
> >> + * asm volatile ("" : : : "memory");
> >> + * b = READ_ONCE(p);
> >> + * } while (a != b);
> >> + * asm volatile ("" : : : "memory"); <-- barrier()
> >> + * return *b;
> >> + * }
> >> + *
> >> + * With gcc 14.2 (arm64):
> >> + *
> >> + * fct_2_volatile_barriers:
> >> + * adrp x0, .LANCHOR0
> >> + * add x0, x0, :lo12:.LANCHOR0
> >> + * .L2:
> >> + * ldr x1, [x0] <-- x1 populated by first load.
> >> + * ldr x2, [x0]
> >> + * cmp x1, x2
> >> + * bne .L2
> >> + * ldr w0, [x1] <-- x1 is used for access which should depend on b.
> >> + * ret
> >> + *
> >> + * On weakly-ordered architectures, this lets CPU speculation use the
> >> + * result from the first load to speculate "ldr w0, [x1]" before
> >> + * "ldr x2, [x0]".
> >> + * Based on the RCU documentation, the control dependency does not
> >> + * prevent the CPU from speculating loads.
> >
> > I'm not sure that example (of something that doesn't work) is really necessary.
> > The simple example of, given:
> > return a == b ? *a : 0;
> > the generated code might speculatively dereference 'b' (not a) before returning
> > zero when the pointers are different.
>
> In the past discussion that led to this new API, AFAIU, Linus made it
> clear that this counter example needs to be in a comment:
I might remember that...
But if you read the proposed comment it starts looking like an example.
It is also very long for the file it is in - even if clearly marked as why
the same effect can't be achieved with barrier().
Maybe the long gory comment belongs in the rst file?
I do wonder if some places need this:
#define OPTIMISER_HIDE_VAL(x) ({ auto _x = x; OPTIMISER_HIDE_VAR(_x); _x; })
Then you could do:
#define ptr_eq(x, y) (OPTIMISER_HIDE_VAL(x) == OPTIMISER_HIDE_VAL(y))
which includes the check that the pointers are the same type.
But it would be more generally useful for hiding constants from the optimiser.
David
>
> https://lore.kernel.org/lkml/CAHk-=wgBgh5U+dyNaN=+XCdcm2OmgSRbcH4Vbtk8i5ZDGwStSA@mail.gmail.com/
>
> This counter-example is what convinced him that this addresses a real
> issue.
>
> Thanks,
>
> Mathieu
>
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency
2025-12-18 9:03 ` David Laight
2025-12-18 13:51 ` Mathieu Desnoyers
@ 2025-12-18 14:27 ` Gary Guo
2025-12-18 16:12 ` David Laight
1 sibling, 1 reply; 29+ messages in thread
From: Gary Guo @ 2025-12-18 14:27 UTC (permalink / raw)
To: David Laight
Cc: Mathieu Desnoyers, Boqun Feng, Joel Fernandes, Paul E. McKenney,
linux-kernel, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Peter Zijlstra, Alan Stern, John Stultz, Neeraj Upadhyay,
Linus Torvalds, Andrew Morton, Frederic Weisbecker,
Josh Triplett, Uladzislau Rezki, Steven Rostedt, Lai Jiangshan,
Zqiang, Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm, Nikita Popov, llvm
On Thu, 18 Dec 2025 09:03:13 +0000
David Laight <david.laight.linux@gmail.com> wrote:
> On Wed, 17 Dec 2025 20:45:28 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > index 5b45ea7dff3e..c5ca3b54c112 100644
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -163,6 +163,69 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
> > __asm__ ("" : "=r" (var) : "0" (var))
> > #endif
> >
> > +/*
> > + * Compare two addresses while preserving the address dependencies for
> > + * later use of the address. It should be used when comparing an address
> > + * returned by rcu_dereference().
> > + *
> > + * This is needed to prevent the compiler CSE and SSA GVN optimizations
> > + * from using @a (or @b) in places where the source refers to @b (or @a)
> > + * based on the fact that after the comparison, the two are known to be
> > + * equal, which does not preserve address dependencies and allows the
> > + * following misordering speculations:
> > + *
> > + * - If @b is a constant, the compiler can issue the loads which depend
> > + * on @a before loading @a.
> > + * - If @b is a register populated by a prior load, weakly-ordered
> > + * CPUs can speculate loads which depend on @a before loading @a.
> > + *
> > + * The same logic applies with @a and @b swapped.
> > + *
> > + * Return value: true if pointers are equal, false otherwise.
> > + *
> > + * The compiler barrier() is ineffective at fixing this issue. It does
> > + * not prevent the compiler CSE from losing the address dependency:
> > + *
> > + * int fct_2_volatile_barriers(void)
> > + * {
> > + * int *a, *b;
> > + *
> > + * do {
> > + * a = READ_ONCE(p);
> > + * asm volatile ("" : : : "memory");
> > + * b = READ_ONCE(p);
> > + * } while (a != b);
> > + * asm volatile ("" : : : "memory"); <-- barrier()
> > + * return *b;
> > + * }
> > + *
> > + * With gcc 14.2 (arm64):
> > + *
> > + * fct_2_volatile_barriers:
> > + * adrp x0, .LANCHOR0
> > + * add x0, x0, :lo12:.LANCHOR0
> > + * .L2:
> > + * ldr x1, [x0] <-- x1 populated by first load.
> > + * ldr x2, [x0]
> > + * cmp x1, x2
> > + * bne .L2
> > + * ldr w0, [x1] <-- x1 is used for access which should depend on b.
> > + * ret
> > + *
> > + * On weakly-ordered architectures, this lets CPU speculation use the
> > + * result from the first load to speculate "ldr w0, [x1]" before
> > + * "ldr x2, [x0]".
> > + * Based on the RCU documentation, the control dependency does not
> > + * prevent the CPU from speculating loads.
>
> I'm not sure that example (of something that doesn't work) is really necessary.
> The simple example of, given:
> return a == b ? *a : 0;
> the generated code might speculatively dereference 'b' (not a) before returning
> zero when the pointers are different.
I'm not sure I understand what you're saying.
`b` cannot be speculatively dereferenced by the compiler in code-path
where pointers are different, as the compiler cannot ascertain that it is
valid.
The speculative execution on the processor side *does not* matter here as
it needs to honour address dependency (unless you're Alpha, which is why we
add a `mb()` in each `READ_ONCE`).
Best,
Gary
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency
2025-12-18 14:27 ` Gary Guo
@ 2025-12-18 16:12 ` David Laight
0 siblings, 0 replies; 29+ messages in thread
From: David Laight @ 2025-12-18 16:12 UTC (permalink / raw)
To: Gary Guo
Cc: Mathieu Desnoyers, Boqun Feng, Joel Fernandes, Paul E. McKenney,
linux-kernel, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Peter Zijlstra, Alan Stern, John Stultz, Neeraj Upadhyay,
Linus Torvalds, Andrew Morton, Frederic Weisbecker,
Josh Triplett, Uladzislau Rezki, Steven Rostedt, Lai Jiangshan,
Zqiang, Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm, Nikita Popov, llvm
On Thu, 18 Dec 2025 14:27:36 +0000
Gary Guo <gary@garyguo.net> wrote:
> On Thu, 18 Dec 2025 09:03:13 +0000
> David Laight <david.laight.linux@gmail.com> wrote:
>
> > On Wed, 17 Dec 2025 20:45:28 -0500
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> >
> > > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > > index 5b45ea7dff3e..c5ca3b54c112 100644
> > > --- a/include/linux/compiler.h
> > > +++ b/include/linux/compiler.h
> > > @@ -163,6 +163,69 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
> > > __asm__ ("" : "=r" (var) : "0" (var))
> > > #endif
> > >
> > > +/*
> > > + * Compare two addresses while preserving the address dependencies for
> > > + * later use of the address. It should be used when comparing an address
> > > + * returned by rcu_dereference().
> > > + *
> > > + * This is needed to prevent the compiler CSE and SSA GVN optimizations
> > > + * from using @a (or @b) in places where the source refers to @b (or @a)
> > > + * based on the fact that after the comparison, the two are known to be
> > > + * equal, which does not preserve address dependencies and allows the
> > > + * following misordering speculations:
> > > + *
> > > + * - If @b is a constant, the compiler can issue the loads which depend
> > > + * on @a before loading @a.
> > > + * - If @b is a register populated by a prior load, weakly-ordered
> > > + * CPUs can speculate loads which depend on @a before loading @a.
> > > + *
> > > + * The same logic applies with @a and @b swapped.
> > > + *
> > > + * Return value: true if pointers are equal, false otherwise.
> > > + *
> > > + * The compiler barrier() is ineffective at fixing this issue. It does
> > > + * not prevent the compiler CSE from losing the address dependency:
> > > + *
> > > + * int fct_2_volatile_barriers(void)
> > > + * {
> > > + * int *a, *b;
> > > + *
> > > + * do {
> > > + * a = READ_ONCE(p);
> > > + * asm volatile ("" : : : "memory");
> > > + * b = READ_ONCE(p);
> > > + * } while (a != b);
> > > + * asm volatile ("" : : : "memory"); <-- barrier()
> > > + * return *b;
> > > + * }
> > > + *
> > > + * With gcc 14.2 (arm64):
> > > + *
> > > + * fct_2_volatile_barriers:
> > > + * adrp x0, .LANCHOR0
> > > + * add x0, x0, :lo12:.LANCHOR0
> > > + * .L2:
> > > + * ldr x1, [x0] <-- x1 populated by first load.
> > > + * ldr x2, [x0]
> > > + * cmp x1, x2
> > > + * bne .L2
> > > + * ldr w0, [x1] <-- x1 is used for access which should depend on b.
> > > + * ret
> > > + *
> > > + * On weakly-ordered architectures, this lets CPU speculation use the
> > > + * result from the first load to speculate "ldr w0, [x1]" before
> > > + * "ldr x2, [x0]".
> > > + * Based on the RCU documentation, the control dependency does not
> > > + * prevent the CPU from speculating loads.
> >
> > I'm not sure that example (of something that doesn't work) is really necessary.
> > The simple example of, given:
> > return a == b ? *a : 0;
> > the generated code might speculatively dereference 'b' (not a) before returning
> > zero when the pointers are different.
>
> I'm not sure I understand what you're saying.
>
> `b` cannot be speculatively dereferenced by the compiler in code-path
> where pointers are different, as the compiler cannot ascertain that it is
> valid.
The 'validity' doesn't matter for speculative execution.
> The speculative execution on the processor side *does not* matter here as
> it needs to honour address dependency (unless you're Alpha, which is why we
> add a `mb()` in each `READ_ONCE`).
There isn't an 'address dependency', that is the problem.
The issue is that 'a == b ? *a : 0' and 'a == b ? *b : 0' always evaluate
to the same value and the compiler will (effectively) substitute one for the
other.
But sometimes you really do care which pointer is speculatively dereferenced
when the they are different.
Memory barriers can only enforce the order of the reads of 'a', 'b' and '*a',
they won't change whether the generated code contains '*a' or '*b'.
David
>
> Best,
> Gary
>
>
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH v4 2/4] Documentation: RCU: Refer to ptr_eq()
2025-12-18 1:45 [RFC PATCH v4 0/4] Hazard Pointers Mathieu Desnoyers
2025-12-18 1:45 ` [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency Mathieu Desnoyers
@ 2025-12-18 1:45 ` Mathieu Desnoyers
2025-12-18 1:45 ` [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers Mathieu Desnoyers
` (2 subsequent siblings)
4 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 1:45 UTC (permalink / raw)
To: Boqun Feng, Joel Fernandes, Paul E. McKenney
Cc: linux-kernel, Mathieu Desnoyers, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm, Gary Guo,
Nikita Popov, llvm
Refer to ptr_eq() in the rcu_dereference() documentation.
ptr_eq() is a mechanism that preserves address dependencies when
comparing pointers, and should be favored when comparing a pointer
obtained from rcu_dereference() against another pointer.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
Cc: Nikita Popov <github@npopov.com>
Cc: llvm@lists.linux.dev
---
Changes since v1:
- Include feedback from Paul E. McKenney.
Changes since v0:
- Include feedback from Alan Stern.
---
Documentation/RCU/rcu_dereference.rst | 38 +++++++++++++++++++++++----
1 file changed, 33 insertions(+), 5 deletions(-)
diff --git a/Documentation/RCU/rcu_dereference.rst b/Documentation/RCU/rcu_dereference.rst
index 2524dcdadde2..de6175bf430f 100644
--- a/Documentation/RCU/rcu_dereference.rst
+++ b/Documentation/RCU/rcu_dereference.rst
@@ -104,11 +104,12 @@ readers working properly:
after such branches, but can speculate loads, which can again
result in misordering bugs.
-- Be very careful about comparing pointers obtained from
- rcu_dereference() against non-NULL values. As Linus Torvalds
- explained, if the two pointers are equal, the compiler could
- substitute the pointer you are comparing against for the pointer
- obtained from rcu_dereference(). For example::
+- Use operations that preserve address dependencies (such as
+ "ptr_eq()") to compare pointers obtained from rcu_dereference()
+ against non-NULL pointers. As Linus Torvalds explained, if the
+ two pointers are equal, the compiler could substitute the
+ pointer you are comparing against for the pointer obtained from
+ rcu_dereference(). For example::
p = rcu_dereference(gp);
if (p == &default_struct)
@@ -125,6 +126,29 @@ readers working properly:
On ARM and Power hardware, the load from "default_struct.a"
can now be speculated, such that it might happen before the
rcu_dereference(). This could result in bugs due to misordering.
+ Performing the comparison with "ptr_eq()" ensures the compiler
+ does not perform such transformation.
+
+ If the comparison is against another pointer, the compiler is
+ allowed to use either pointer for the following accesses, which
+ loses the address dependency and allows weakly-ordered
+ architectures such as ARM and PowerPC to speculate the
+ address-dependent load before rcu_dereference(). For example::
+
+ p1 = READ_ONCE(gp);
+ p2 = rcu_dereference(gp);
+ if (p1 == p2) /* BUGGY!!! */
+ do_default(p2->a);
+
+ The compiler can use p1->a rather than p2->a, destroying the
+ address dependency. Performing the comparison with "ptr_eq()"
+ ensures the compiler preserves the address dependencies.
+ Corrected code::
+
+ p1 = READ_ONCE(gp);
+ p2 = rcu_dereference(gp);
+ if (ptr_eq(p1, p2))
+ do_default(p2->a);
However, comparisons are OK in the following cases:
@@ -204,6 +228,10 @@ readers working properly:
comparison will provide exactly the information that the
compiler needs to deduce the value of the pointer.
+ When in doubt, use operations that preserve address dependencies
+ (such as "ptr_eq()") to compare pointers obtained from
+ rcu_dereference() against non-NULL pointers.
+
- Disable any value-speculation optimizations that your compiler
might provide, especially if you are making use of feedback-based
optimizations that take data collected from prior runs. Such
--
2.39.5
^ permalink raw reply [flat|nested] 29+ messages in thread* [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-18 1:45 [RFC PATCH v4 0/4] Hazard Pointers Mathieu Desnoyers
2025-12-18 1:45 ` [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency Mathieu Desnoyers
2025-12-18 1:45 ` [RFC PATCH v4 2/4] Documentation: RCU: Refer to ptr_eq() Mathieu Desnoyers
@ 2025-12-18 1:45 ` Mathieu Desnoyers
2025-12-18 8:36 ` Boqun Feng
2025-12-19 1:22 ` Joel Fernandes
2025-12-18 1:45 ` [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch Mathieu Desnoyers
2025-12-18 10:33 ` [RFC PATCH v4 0/4] Hazard Pointers Joel Fernandes
4 siblings, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 1:45 UTC (permalink / raw)
To: Boqun Feng, Joel Fernandes, Paul E. McKenney
Cc: linux-kernel, Mathieu Desnoyers, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
This API provides existence guarantees of objects through Hazard
Pointers [1] (hazptr).
Its main benefit over RCU is that it allows fast reclaim of
HP-protected pointers without needing to wait for a grace period.
This implementation has 8 statically allocated hazard pointer slots per
cpu for the fast path, and relies on a on-stack backup slot allocated by
the hazard pointer user as fallback in case no per-cpu slot is
available.
References:
[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
lock-free objects," in IEEE Transactions on Parallel and
Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
Link: https://lpc.events/event/19/contributions/2082/
Link: https://lore.kernel.org/lkml/j3scdl5iymjlxavomgc6u5ndg3svhab6ga23dr36o4f5mt333w@7xslvq6b6hmv/
Link: https://lpc.events/event/18/contributions/1731/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
---
Changes since v3:
- Rename hazptr_retire to hazptr_release.
- Remove domains.
- Introduce "backup_slot" within hazptr context structure (on stack)
to handle slot overflow.
- Rename hazptr_try_protect to hazptr_acquire.
- Preallocate 8 per-CPU slots, and rely on caller-provided backup
slots (typically on stack) for out-of-slots situations.
Changes since v2:
- Address Peter Zijlstra's comments.
- Address Paul E. McKenney's comments.
Changes since v0:
- Remove slot variable from hp_dereference_allocate().
---
include/linux/hazptr.h | 182 +++++++++++++++++++++++++++++++++++++++++
init/main.c | 2 +
kernel/Makefile | 2 +-
kernel/hazptr.c | 150 +++++++++++++++++++++++++++++++++
4 files changed, 335 insertions(+), 1 deletion(-)
create mode 100644 include/linux/hazptr.h
create mode 100644 kernel/hazptr.c
diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
new file mode 100644
index 000000000000..70c066ddb0f5
--- /dev/null
+++ b/include/linux/hazptr.h
@@ -0,0 +1,182 @@
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+#ifndef _LINUX_HAZPTR_H
+#define _LINUX_HAZPTR_H
+
+/*
+ * hazptr: Hazard Pointers
+ *
+ * This API provides existence guarantees of objects through hazard
+ * pointers.
+ *
+ * Its main benefit over RCU is that it allows fast reclaim of
+ * HP-protected pointers without needing to wait for a grace period.
+ *
+ * References:
+ *
+ * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
+ * lock-free objects," in IEEE Transactions on Parallel and
+ * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
+ */
+
+#include <linux/percpu.h>
+#include <linux/types.h>
+#include <linux/cleanup.h>
+
+/* 8 slots (each sizeof(void *)) fit in a single cache line. */
+#define NR_HAZPTR_PERCPU_SLOTS 8
+
+/*
+ * Hazard pointer slot.
+ */
+struct hazptr_slot {
+ void *addr;
+};
+
+struct hazptr_backup_slot {
+ struct list_head node;
+ struct hazptr_slot slot;
+ /* CPU requesting the backup slot. */
+ int cpu;
+};
+
+struct hazptr_ctx {
+ struct hazptr_slot *slot;
+ /* Backup slot in case all per-CPU slots are used. */
+ struct hazptr_backup_slot backup_slot;
+};
+
+struct hazptr_percpu_slots {
+ struct hazptr_slot slots[NR_HAZPTR_PERCPU_SLOTS];
+} ____cacheline_aligned;
+
+DECLARE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots);
+
+/*
+ * hazptr_synchronize: Wait until @addr is released from all slots.
+ *
+ * Wait to observe that each slot contains a value that differs from
+ * @addr before returning.
+ * Should be called from preemptible context.
+ */
+void hazptr_synchronize(void *addr);
+
+/*
+ * hazptr_chain_backup_slot: Chain backup slot into overflow list.
+ *
+ * Set backup slot address to @addr, and chain it into the overflow
+ * list.
+ */
+struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx);
+
+/*
+ * hazptr_unchain_backup_slot: Unchain backup slot from overflow list.
+ */
+void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx);
+
+static inline
+struct hazptr_slot *hazptr_get_free_percpu_slot(void)
+{
+ struct hazptr_percpu_slots *percpu_slots = this_cpu_ptr(&hazptr_percpu_slots);
+ unsigned int idx;
+
+ for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
+ struct hazptr_slot *slot = &percpu_slots->slots[idx];
+
+ if (!READ_ONCE(slot->addr))
+ return slot;
+ }
+ /* All slots are in use. */
+ return NULL;
+}
+
+static inline
+bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot)
+{
+ return slot == &ctx->backup_slot.slot;
+}
+
+/*
+ * hazptr_acquire: Load pointer at address and protect with hazard pointer.
+ *
+ * Load @addr_p, and protect the loaded pointer with hazard pointer.
+ *
+ * Returns a non-NULL protected address if the loaded pointer is non-NULL.
+ * Returns NULL if the loaded pointer is NULL.
+ *
+ * On success the protected hazptr slot is stored in @ctx->slot.
+ */
+static inline
+void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
+{
+ struct hazptr_slot *slot = NULL;
+ void *addr, *addr2;
+
+ /*
+ * Load @addr_p to know which address should be protected.
+ */
+ addr = READ_ONCE(*addr_p);
+ for (;;) {
+ if (!addr)
+ return NULL;
+ guard(preempt)();
+ if (likely(!hazptr_slot_is_backup(ctx, slot))) {
+ slot = hazptr_get_free_percpu_slot();
+ /*
+ * If all the per-CPU slots are already in use, fallback
+ * to the backup slot.
+ */
+ if (unlikely(!slot))
+ slot = hazptr_chain_backup_slot(ctx);
+ }
+ WRITE_ONCE(slot->addr, addr); /* Store B */
+
+ /* Memory ordering: Store B before Load A. */
+ smp_mb();
+
+ /*
+ * Re-load @addr_p after storing it to the hazard pointer slot.
+ */
+ addr2 = READ_ONCE(*addr_p); /* Load A */
+ if (likely(ptr_eq(addr2, addr)))
+ break;
+ /*
+ * If @addr_p content has changed since the first load,
+ * release the hazard pointer and try again.
+ */
+ WRITE_ONCE(slot->addr, NULL);
+ if (!addr2) {
+ if (hazptr_slot_is_backup(ctx, slot))
+ hazptr_unchain_backup_slot(ctx);
+ return NULL;
+ }
+ addr = addr2;
+ }
+ ctx->slot = slot;
+ /*
+ * Use addr2 loaded from the second READ_ONCE() to preserve
+ * address dependency ordering.
+ */
+ return addr2;
+}
+
+/* Release the protected hazard pointer from @slot. */
+static inline
+void hazptr_release(struct hazptr_ctx *ctx, void *addr)
+{
+ struct hazptr_slot *slot;
+
+ if (!addr)
+ return;
+ slot = ctx->slot;
+ WARN_ON_ONCE(slot->addr != addr);
+ smp_store_release(&slot->addr, NULL);
+ if (unlikely(hazptr_slot_is_backup(ctx, slot)))
+ hazptr_unchain_backup_slot(ctx);
+}
+
+void hazptr_init(void);
+
+#endif /* _LINUX_HAZPTR_H */
diff --git a/init/main.c b/init/main.c
index 07a3116811c5..858eaa87bde7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -104,6 +104,7 @@
#include <linux/pidfs.h>
#include <linux/ptdump.h>
#include <linux/time_namespace.h>
+#include <linux/hazptr.h>
#include <net/net_namespace.h>
#include <asm/io.h>
@@ -1002,6 +1003,7 @@ void start_kernel(void)
workqueue_init_early();
rcu_init();
+ hazptr_init();
kvfree_rcu_init();
/* Trace events are available after this */
diff --git a/kernel/Makefile b/kernel/Makefile
index 9fe722305c9b..1178907fe0ea 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
cpu.o exit.o softirq.o resource.o \
sysctl.o capability.o ptrace.o user.o \
signal.o sys.o umh.o workqueue.o pid.o task_work.o \
- extable.o params.o \
+ extable.o params.o hazptr.o \
kthread.o sys_ni.o nsproxy.o nstree.o nscommon.o \
notifier.o ksysfs.o cred.o reboot.o \
async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
diff --git a/kernel/hazptr.c b/kernel/hazptr.c
new file mode 100644
index 000000000000..2ec288bc1132
--- /dev/null
+++ b/kernel/hazptr.c
@@ -0,0 +1,150 @@
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+/*
+ * hazptr: Hazard Pointers
+ */
+
+#include <linux/hazptr.h>
+#include <linux/percpu.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/export.h>
+
+struct overflow_list {
+ raw_spinlock_t lock; /* Lock protecting overflow list and list generation. */
+ struct list_head head; /* Overflow list head. */
+ uint64_t gen; /* Overflow list generation. */
+};
+
+static DEFINE_PER_CPU(struct overflow_list, percpu_overflow_list);
+
+DEFINE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots);
+EXPORT_PER_CPU_SYMBOL_GPL(hazptr_percpu_slots);
+
+/*
+ * Perform piecewise iteration on overflow list waiting until "addr" is
+ * not present. Raw spinlock is released and taken between each list
+ * item and busy loop iteration. The overflow list generation is checked
+ * each time the lock is taken to validate that the list has not changed
+ * before resuming iteration or busy wait. If the generation has
+ * changed, retry the entire list traversal.
+ */
+static
+void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
+{
+ struct hazptr_backup_slot *backup_slot;
+ uint64_t snapshot_gen;
+
+ raw_spin_lock(&overflow_list->lock);
+retry:
+ snapshot_gen = overflow_list->gen;
+ list_for_each_entry(backup_slot, &overflow_list->head, node) {
+ /* Busy-wait if node is found. */
+ while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
+ raw_spin_unlock(&overflow_list->lock);
+ cpu_relax();
+ raw_spin_lock(&overflow_list->lock);
+ if (overflow_list->gen != snapshot_gen)
+ goto retry;
+ }
+ raw_spin_unlock(&overflow_list->lock);
+ /*
+ * Release raw spinlock, validate generation after
+ * re-acquiring the lock.
+ */
+ raw_spin_lock(&overflow_list->lock);
+ if (overflow_list->gen != snapshot_gen)
+ goto retry;
+ }
+ raw_spin_unlock(&overflow_list->lock);
+}
+
+static
+void hazptr_synchronize_cpu_slots(int cpu, void *addr)
+{
+ struct hazptr_percpu_slots *percpu_slots = per_cpu_ptr(&hazptr_percpu_slots, cpu);
+ unsigned int idx;
+
+ for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
+ struct hazptr_slot *slot = &percpu_slots->slots[idx];
+
+ /* Busy-wait if node is found. */
+ smp_cond_load_acquire(&slot->addr, VAL != addr); /* Load B */
+ }
+}
+
+/*
+ * hazptr_synchronize: Wait until @addr is released from all slots.
+ *
+ * Wait to observe that each slot contains a value that differs from
+ * @addr before returning.
+ * Should be called from preemptible context.
+ */
+void hazptr_synchronize(void *addr)
+{
+ int cpu;
+
+ /*
+ * Busy-wait should only be done from preemptible context.
+ */
+ lockdep_assert_preemption_enabled();
+
+ /*
+ * Store A precedes hazptr_scan(): it unpublishes addr (sets it to
+ * NULL or to a different value), and thus hides it from hazard
+ * pointer readers.
+ */
+ if (!addr)
+ return;
+ /* Memory ordering: Store A before Load B. */
+ smp_mb();
+ /* Scan all CPUs slots. */
+ for_each_possible_cpu(cpu) {
+ /* Scan CPU slots. */
+ hazptr_synchronize_cpu_slots(cpu, addr);
+ /* Scan backup slots in percpu overflow list. */
+ hazptr_synchronize_overflow_list(per_cpu_ptr(&percpu_overflow_list, cpu), addr);
+ }
+}
+EXPORT_SYMBOL_GPL(hazptr_synchronize);
+
+struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx)
+{
+ struct overflow_list *overflow_list = this_cpu_ptr(&percpu_overflow_list);
+ struct hazptr_slot *slot = &ctx->backup_slot.slot;
+
+ slot->addr = NULL;
+
+ raw_spin_lock(&overflow_list->lock);
+ overflow_list->gen++;
+ list_add(&ctx->backup_slot.node, &overflow_list->head);
+ ctx->backup_slot.cpu = smp_processor_id();
+ raw_spin_unlock(&overflow_list->lock);
+ return slot;
+}
+EXPORT_SYMBOL_GPL(hazptr_chain_backup_slot);
+
+void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx)
+{
+ struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, ctx->backup_slot.cpu);
+
+ raw_spin_lock(&overflow_list->lock);
+ overflow_list->gen++;
+ list_del(&ctx->backup_slot.node);
+ raw_spin_unlock(&overflow_list->lock);
+}
+EXPORT_SYMBOL_GPL(hazptr_unchain_backup_slot);
+
+void __init hazptr_init(void)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, cpu);
+
+ raw_spin_lock_init(&overflow_list->lock);
+ INIT_LIST_HEAD(&overflow_list->head);
+ }
+}
--
2.39.5
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-18 1:45 ` [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers Mathieu Desnoyers
@ 2025-12-18 8:36 ` Boqun Feng
2025-12-18 17:35 ` Mathieu Desnoyers
2025-12-19 1:22 ` Joel Fernandes
1 sibling, 1 reply; 29+ messages in thread
From: Boqun Feng @ 2025-12-18 8:36 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On Wed, Dec 17, 2025 at 08:45:30PM -0500, Mathieu Desnoyers wrote:
[...]
> +static inline
> +struct hazptr_slot *hazptr_get_free_percpu_slot(void)
> +{
> + struct hazptr_percpu_slots *percpu_slots = this_cpu_ptr(&hazptr_percpu_slots);
> + unsigned int idx;
> +
> + for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
> + struct hazptr_slot *slot = &percpu_slots->slots[idx];
> +
> + if (!READ_ONCE(slot->addr))
> + return slot;
> + }
> + /* All slots are in use. */
> + return NULL;
> +}
> +
> +static inline
> +bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot)
> +{
> + return slot == &ctx->backup_slot.slot;
> +}
> +
> +/*
> + * hazptr_acquire: Load pointer at address and protect with hazard pointer.
> + *
> + * Load @addr_p, and protect the loaded pointer with hazard pointer.
> + *
> + * Returns a non-NULL protected address if the loaded pointer is non-NULL.
> + * Returns NULL if the loaded pointer is NULL.
> + *
> + * On success the protected hazptr slot is stored in @ctx->slot.
> + */
> +static inline
> +void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
> +{
> + struct hazptr_slot *slot = NULL;
> + void *addr, *addr2;
> +
> + /*
> + * Load @addr_p to know which address should be protected.
> + */
> + addr = READ_ONCE(*addr_p);
> + for (;;) {
> + if (!addr)
> + return NULL;
> + guard(preempt)();
> + if (likely(!hazptr_slot_is_backup(ctx, slot))) {
> + slot = hazptr_get_free_percpu_slot();
I need to continue share my concerns about this "allocating slot while
protecting" pattern. Here realistically, we will go over a few of the
per-CPU hazard pointer slots *every time* instead of directly using a
pre-allocated hazard pointer slot. Could you utilize this[1] to see a
comparison of the reader-side performance against RCU/SRCU?
> + /*
> + * If all the per-CPU slots are already in use, fallback
> + * to the backup slot.
> + */
> + if (unlikely(!slot))
> + slot = hazptr_chain_backup_slot(ctx);
> + }
> + WRITE_ONCE(slot->addr, addr); /* Store B */
> +
> + /* Memory ordering: Store B before Load A. */
> + smp_mb();
> +
> + /*
> + * Re-load @addr_p after storing it to the hazard pointer slot.
> + */
> + addr2 = READ_ONCE(*addr_p); /* Load A */
> + if (likely(ptr_eq(addr2, addr)))
> + break;
> + /*
> + * If @addr_p content has changed since the first load,
> + * release the hazard pointer and try again.
> + */
> + WRITE_ONCE(slot->addr, NULL);
> + if (!addr2) {
> + if (hazptr_slot_is_backup(ctx, slot))
> + hazptr_unchain_backup_slot(ctx);
> + return NULL;
> + }
> + addr = addr2;
> + }
> + ctx->slot = slot;
> + /*
> + * Use addr2 loaded from the second READ_ONCE() to preserve
> + * address dependency ordering.
> + */
> + return addr2;
> +}
> +
> +/* Release the protected hazard pointer from @slot. */
> +static inline
> +void hazptr_release(struct hazptr_ctx *ctx, void *addr)
> +{
> + struct hazptr_slot *slot;
> +
> + if (!addr)
> + return;
> + slot = ctx->slot;
> + WARN_ON_ONCE(slot->addr != addr);
> + smp_store_release(&slot->addr, NULL);
> + if (unlikely(hazptr_slot_is_backup(ctx, slot)))
> + hazptr_unchain_backup_slot(ctx);
> +}
> +
> +void hazptr_init(void);
> +
> +#endif /* _LINUX_HAZPTR_H */
> diff --git a/init/main.c b/init/main.c
> index 07a3116811c5..858eaa87bde7 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -104,6 +104,7 @@
> #include <linux/pidfs.h>
> #include <linux/ptdump.h>
> #include <linux/time_namespace.h>
> +#include <linux/hazptr.h>
> #include <net/net_namespace.h>
>
> #include <asm/io.h>
> @@ -1002,6 +1003,7 @@ void start_kernel(void)
> workqueue_init_early();
>
> rcu_init();
> + hazptr_init();
> kvfree_rcu_init();
>
> /* Trace events are available after this */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 9fe722305c9b..1178907fe0ea 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
> cpu.o exit.o softirq.o resource.o \
> sysctl.o capability.o ptrace.o user.o \
> signal.o sys.o umh.o workqueue.o pid.o task_work.o \
> - extable.o params.o \
> + extable.o params.o hazptr.o \
> kthread.o sys_ni.o nsproxy.o nstree.o nscommon.o \
> notifier.o ksysfs.o cred.o reboot.o \
> async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
> diff --git a/kernel/hazptr.c b/kernel/hazptr.c
> new file mode 100644
> index 000000000000..2ec288bc1132
> --- /dev/null
> +++ b/kernel/hazptr.c
> @@ -0,0 +1,150 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +/*
> + * hazptr: Hazard Pointers
> + */
> +
> +#include <linux/hazptr.h>
> +#include <linux/percpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/list.h>
> +#include <linux/export.h>
> +
> +struct overflow_list {
> + raw_spinlock_t lock; /* Lock protecting overflow list and list generation. */
> + struct list_head head; /* Overflow list head. */
> + uint64_t gen; /* Overflow list generation. */
> +};
> +
> +static DEFINE_PER_CPU(struct overflow_list, percpu_overflow_list);
> +
> +DEFINE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots);
> +EXPORT_PER_CPU_SYMBOL_GPL(hazptr_percpu_slots);
> +
> +/*
> + * Perform piecewise iteration on overflow list waiting until "addr" is
> + * not present. Raw spinlock is released and taken between each list
> + * item and busy loop iteration. The overflow list generation is checked
> + * each time the lock is taken to validate that the list has not changed
> + * before resuming iteration or busy wait. If the generation has
> + * changed, retry the entire list traversal.
> + */
> +static
> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
> +{
> + struct hazptr_backup_slot *backup_slot;
> + uint64_t snapshot_gen;
> +
> + raw_spin_lock(&overflow_list->lock);
> +retry:
> + snapshot_gen = overflow_list->gen;
> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
> + /* Busy-wait if node is found. */
> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> + raw_spin_unlock(&overflow_list->lock);
> + cpu_relax();
I think we should prioritize the scan thread solution [2] instead of
busy waiting hazrd pointer updaters, because when we have multiple
hazard pointer usages we would want to consolidate the scans from
updater side. If so, the whole ->gen can be avoided.
However this ->gen idea does seem ot resolve another issue for me, I'm
trying to make shazptr critical section preemptive by using a per-task
backup slot (if you recall, this is your idea from the hallway
discussions we had during LPC 2024), and currently I could not make it
work because the following sequeue:
1. CPU 0 already has one pointer protected.
2. CPU 1 begins the updater scan, and it scans the list of preempted
hazard pointer readers, no reader.
3. CPU 0 does a context switch, it stores the current hazard pointer
value to the current task's ->hazard_slot (let's say the task is task
A), and add it to the list of preempted hazard pointer readers.
4. CPU 0 clears its percpu hazptr_slots for the next task (B).
5. CPU 1 continues the updater scan, and it scans the percpu slot of
CPU 0, and finds no reader.
in this situation, updater will miss a reader. But if we add a
generation snapshotting at step 2 and generation increment at step 3, I
think it'll work.
IMO, if we make this work, it's better than the current backup slot
mechanism IMO, because we only need to acquire the lock if context
switch happens. I will look into the implementation of this and if I
could get it down, I will send it in my next version of shazptr. Mention
it here just to add this option into the discussion.
[1]: https://lore.kernel.org/lkml/20250625031101.12555-3-boqun.feng@gmail.com/
[2]: https://lore.kernel.org/lkml/20250625031101.12555-5-boqun.feng@gmail.com/
Regards,
Boqun
> + raw_spin_lock(&overflow_list->lock);
> + if (overflow_list->gen != snapshot_gen)
> + goto retry;
> + }
> + raw_spin_unlock(&overflow_list->lock);
> + /*
> + * Release raw spinlock, validate generation after
> + * re-acquiring the lock.
> + */
> + raw_spin_lock(&overflow_list->lock);
> + if (overflow_list->gen != snapshot_gen)
> + goto retry;
> + }
> + raw_spin_unlock(&overflow_list->lock);
> +}
> +
> +static
> +void hazptr_synchronize_cpu_slots(int cpu, void *addr)
> +{
> + struct hazptr_percpu_slots *percpu_slots = per_cpu_ptr(&hazptr_percpu_slots, cpu);
> + unsigned int idx;
> +
> + for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) {
> + struct hazptr_slot *slot = &percpu_slots->slots[idx];
> +
> + /* Busy-wait if node is found. */
> + smp_cond_load_acquire(&slot->addr, VAL != addr); /* Load B */
> + }
> +}
> +
> +/*
> + * hazptr_synchronize: Wait until @addr is released from all slots.
> + *
> + * Wait to observe that each slot contains a value that differs from
> + * @addr before returning.
> + * Should be called from preemptible context.
> + */
> +void hazptr_synchronize(void *addr)
> +{
> + int cpu;
> +
> + /*
> + * Busy-wait should only be done from preemptible context.
> + */
> + lockdep_assert_preemption_enabled();
> +
> + /*
> + * Store A precedes hazptr_scan(): it unpublishes addr (sets it to
> + * NULL or to a different value), and thus hides it from hazard
> + * pointer readers.
> + */
> + if (!addr)
> + return;
> + /* Memory ordering: Store A before Load B. */
> + smp_mb();
> + /* Scan all CPUs slots. */
> + for_each_possible_cpu(cpu) {
> + /* Scan CPU slots. */
> + hazptr_synchronize_cpu_slots(cpu, addr);
> + /* Scan backup slots in percpu overflow list. */
> + hazptr_synchronize_overflow_list(per_cpu_ptr(&percpu_overflow_list, cpu), addr);
> + }
> +}
> +EXPORT_SYMBOL_GPL(hazptr_synchronize);
> +
> +struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx)
> +{
> + struct overflow_list *overflow_list = this_cpu_ptr(&percpu_overflow_list);
> + struct hazptr_slot *slot = &ctx->backup_slot.slot;
> +
> + slot->addr = NULL;
> +
> + raw_spin_lock(&overflow_list->lock);
> + overflow_list->gen++;
> + list_add(&ctx->backup_slot.node, &overflow_list->head);
> + ctx->backup_slot.cpu = smp_processor_id();
> + raw_spin_unlock(&overflow_list->lock);
> + return slot;
> +}
> +EXPORT_SYMBOL_GPL(hazptr_chain_backup_slot);
> +
> +void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx)
> +{
> + struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, ctx->backup_slot.cpu);
> +
> + raw_spin_lock(&overflow_list->lock);
> + overflow_list->gen++;
> + list_del(&ctx->backup_slot.node);
> + raw_spin_unlock(&overflow_list->lock);
> +}
> +EXPORT_SYMBOL_GPL(hazptr_unchain_backup_slot);
> +
> +void __init hazptr_init(void)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, cpu);
> +
> + raw_spin_lock_init(&overflow_list->lock);
> + INIT_LIST_HEAD(&overflow_list->head);
> + }
> +}
> --
> 2.39.5
>
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-18 8:36 ` Boqun Feng
@ 2025-12-18 17:35 ` Mathieu Desnoyers
2025-12-18 20:22 ` Boqun Feng
2025-12-19 0:43 ` Boqun Feng
0 siblings, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 17:35 UTC (permalink / raw)
To: Boqun Feng
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 2025-12-18 03:36, Boqun Feng wrote:
> On Wed, Dec 17, 2025 at 08:45:30PM -0500, Mathieu Desnoyers wrote:
[...]
>> +static inline
>> +void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
>> +{
>> + struct hazptr_slot *slot = NULL;
>> + void *addr, *addr2;
>> +
>> + /*
>> + * Load @addr_p to know which address should be protected.
>> + */
>> + addr = READ_ONCE(*addr_p);
>> + for (;;) {
>> + if (!addr)
>> + return NULL;
>> + guard(preempt)();
>> + if (likely(!hazptr_slot_is_backup(ctx, slot))) {
>> + slot = hazptr_get_free_percpu_slot();
>
> I need to continue share my concerns about this "allocating slot while
> protecting" pattern. Here realistically, we will go over a few of the
> per-CPU hazard pointer slots *every time* instead of directly using a
> pre-allocated hazard pointer slot.
No, that's not the expected fast-path with CONFIG_PREEMPT_HAZPTR=y
(introduced in patch 4/4).
With PREEMPT_HAZPTR, using more than one hazard pointer per CPU will
only happen if there are nested hazard pointer users, which can happen
due to:
- Holding a hazard pointer across function calls, where another hazard
pointer is used.
- Using hazard pointers from interrupt handlers (note: my current code
only does preempt disable, not irq disable, this is something I'd need
to change if we wish to acquire/release hazard pointers from interrupt
handlers). But even that should be a rare event.
So the fast-path has an initial state where there are no hazard pointers
in use on the CPU, which means hazptr_acquire() finds its empty slot at
index 0.
> Could you utilize this[1] to see a
> comparison of the reader-side performance against RCU/SRCU?
Good point ! Let's see.
On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
hyperthreading disabled,
CONFIG_PREEMPT=y,
CONFIG_PREEMPT_RCU=y,
CONFIG_PREEMPT_HAZPTR=y.
scale_type ns
-----------------------
hazptr-smp-mb 13.1 <- this implementation
hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
rcu 17.0
srcu 20.0
srcu-fast 1.5
rcu-tasks 0.0
rcu-trace 1.7
refcnt 1148.0
rwlock 1190.0
rwsem 4199.3
lock 41070.6
lock-irq 46176.3
acqrel 1.1
So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
appear to beat hazptr read-side performance.
[...]
>> +/*
>> + * Perform piecewise iteration on overflow list waiting until "addr" is
>> + * not present. Raw spinlock is released and taken between each list
>> + * item and busy loop iteration. The overflow list generation is checked
>> + * each time the lock is taken to validate that the list has not changed
>> + * before resuming iteration or busy wait. If the generation has
>> + * changed, retry the entire list traversal.
>> + */
>> +static
>> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
>> +{
>> + struct hazptr_backup_slot *backup_slot;
>> + uint64_t snapshot_gen;
>> +
>> + raw_spin_lock(&overflow_list->lock);
>> +retry:
>> + snapshot_gen = overflow_list->gen;
>> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
>> + /* Busy-wait if node is found. */
>> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>> + raw_spin_unlock(&overflow_list->lock);
>> + cpu_relax();
>
> I think we should prioritize the scan thread solution [2] instead of
> busy waiting hazrd pointer updaters, because when we have multiple
> hazard pointer usages we would want to consolidate the scans from
> updater side.
I agree that batching scans with a worker thread is a logical next step.
> If so, the whole ->gen can be avoided.
How would it allow removing the generation trick without causing long
raw spinlock latencies ?
>
> However this ->gen idea does seem ot resolve another issue for me, I'm
> trying to make shazptr critical section preemptive by using a per-task
> backup slot (if you recall, this is your idea from the hallway
> discussions we had during LPC 2024),
I honestly did not remember. It's been a whole year! ;-)
> and currently I could not make it
> work because the following sequeue:
>
> 1. CPU 0 already has one pointer protected.
>
> 2. CPU 1 begins the updater scan, and it scans the list of preempted
> hazard pointer readers, no reader.
>
> 3. CPU 0 does a context switch, it stores the current hazard pointer
> value to the current task's ->hazard_slot (let's say the task is task
> A), and add it to the list of preempted hazard pointer readers.
>
> 4. CPU 0 clears its percpu hazptr_slots for the next task (B).
>
> 5. CPU 1 continues the updater scan, and it scans the percpu slot of
> CPU 0, and finds no reader.
>
> in this situation, updater will miss a reader. But if we add a
> generation snapshotting at step 2 and generation increment at step 3, I
> think it'll work.
>
> IMO, if we make this work, it's better than the current backup slot
> mechanism IMO, because we only need to acquire the lock if context
> switch happens.
With PREEMPT_HAZPTR we also only need to acquire the per-cpu overflow
list raw spinlock on context switch (preemption or blocking). The only
other case requiring it is hazptr nested usage (more than 8 active
hazptr) on a thread context + nested irqs.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-18 17:35 ` Mathieu Desnoyers
@ 2025-12-18 20:22 ` Boqun Feng
2025-12-18 23:36 ` Mathieu Desnoyers
2025-12-19 0:43 ` Boqun Feng
1 sibling, 1 reply; 29+ messages in thread
From: Boqun Feng @ 2025-12-18 20:22 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
> On 2025-12-18 03:36, Boqun Feng wrote:
> > On Wed, Dec 17, 2025 at 08:45:30PM -0500, Mathieu Desnoyers wrote:
> [...]
> > > +static inline
> > > +void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
> > > +{
> > > + struct hazptr_slot *slot = NULL;
> > > + void *addr, *addr2;
> > > +
> > > + /*
> > > + * Load @addr_p to know which address should be protected.
> > > + */
> > > + addr = READ_ONCE(*addr_p);
> > > + for (;;) {
> > > + if (!addr)
> > > + return NULL;
> > > + guard(preempt)();
> > > + if (likely(!hazptr_slot_is_backup(ctx, slot))) {
> > > + slot = hazptr_get_free_percpu_slot();
> >
> > I need to continue share my concerns about this "allocating slot while
> > protecting" pattern. Here realistically, we will go over a few of the
> > per-CPU hazard pointer slots *every time* instead of directly using a
> > pre-allocated hazard pointer slot.
>
> No, that's not the expected fast-path with CONFIG_PREEMPT_HAZPTR=y
> (introduced in patch 4/4).
>
I see, I was missing the patch #4, will take a look and reply
accordingly.
> With PREEMPT_HAZPTR, using more than one hazard pointer per CPU will
> only happen if there are nested hazard pointer users, which can happen
> due to:
>
> - Holding a hazard pointer across function calls, where another hazard
> pointer is used.
> - Using hazard pointers from interrupt handlers (note: my current code
> only does preempt disable, not irq disable, this is something I'd need
> to change if we wish to acquire/release hazard pointers from interrupt
> handlers). But even that should be a rare event.
>
> So the fast-path has an initial state where there are no hazard pointers
> in use on the CPU, which means hazptr_acquire() finds its empty slot at
> index 0.
>
> > Could you utilize this[1] to see a
> > comparison of the reader-side performance against RCU/SRCU?
>
> Good point ! Let's see.
>
> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> hyperthreading disabled,
> CONFIG_PREEMPT=y,
> CONFIG_PREEMPT_RCU=y,
> CONFIG_PREEMPT_HAZPTR=y.
>
> scale_type ns
> -----------------------
> hazptr-smp-mb 13.1 <- this implementation
> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
> rcu 17.0
> srcu 20.0
> srcu-fast 1.5
> rcu-tasks 0.0
> rcu-trace 1.7
> refcnt 1148.0
> rwlock 1190.0
> rwsem 4199.3
> lock 41070.6
> lock-irq 46176.3
> acqrel 1.1
>
> So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
> appear to beat hazptr read-side performance.
>
Could you also see the reader-side performance impact when the percpu
hazard pointer slots are used up? I.e. the worst case.
> [...]
>
> > > +/*
> > > + * Perform piecewise iteration on overflow list waiting until "addr" is
> > > + * not present. Raw spinlock is released and taken between each list
> > > + * item and busy loop iteration. The overflow list generation is checked
> > > + * each time the lock is taken to validate that the list has not changed
> > > + * before resuming iteration or busy wait. If the generation has
> > > + * changed, retry the entire list traversal.
> > > + */
> > > +static
> > > +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
> > > +{
> > > + struct hazptr_backup_slot *backup_slot;
> > > + uint64_t snapshot_gen;
> > > +
> > > + raw_spin_lock(&overflow_list->lock);
> > > +retry:
> > > + snapshot_gen = overflow_list->gen;
> > > + list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > > + /* Busy-wait if node is found. */
> > > + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> > > + raw_spin_unlock(&overflow_list->lock);
> > > + cpu_relax();
> >
> > I think we should prioritize the scan thread solution [2] instead of
> > busy waiting hazrd pointer updaters, because when we have multiple
> > hazard pointer usages we would want to consolidate the scans from
> > updater side.
>
> I agree that batching scans with a worker thread is a logical next step.
>
> > If so, the whole ->gen can be avoided.
>
> How would it allow removing the generation trick without causing long
> raw spinlock latencies ?
>
Because we won't need to busy-wait for the readers to go away, we can
check whether they are still there in the next scan.
so:
list_for_each_entry(backup_slot, &overflow_list->head, node) {
/* Busy-wait if node is found. */
if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
<mark addr as unable to free and move on>
> >
> > However this ->gen idea does seem ot resolve another issue for me, I'm
> > trying to make shazptr critical section preemptive by using a per-task
> > backup slot (if you recall, this is your idea from the hallway
> > discussions we had during LPC 2024),
>
> I honestly did not remember. It's been a whole year! ;-)
>
> > and currently I could not make it
> > work because the following sequeue:
> >
> > 1. CPU 0 already has one pointer protected.
> >
> > 2. CPU 1 begins the updater scan, and it scans the list of preempted
> > hazard pointer readers, no reader.
> >
> > 3. CPU 0 does a context switch, it stores the current hazard pointer
> > value to the current task's ->hazard_slot (let's say the task is task
> > A), and add it to the list of preempted hazard pointer readers.
> >
> > 4. CPU 0 clears its percpu hazptr_slots for the next task (B).
> >
> > 5. CPU 1 continues the updater scan, and it scans the percpu slot of
> > CPU 0, and finds no reader.
> >
> > in this situation, updater will miss a reader. But if we add a
> > generation snapshotting at step 2 and generation increment at step 3, I
> > think it'll work.
> >
> > IMO, if we make this work, it's better than the current backup slot
> > mechanism IMO, because we only need to acquire the lock if context
> > switch happens.
>
> With PREEMPT_HAZPTR we also only need to acquire the per-cpu overflow
> list raw spinlock on context switch (preemption or blocking). The only
Indeed, pre-allocating the slot on the stack to save the percpu slot
when context switch seems easier and quite smart ;-) Let me take a look.
Regards,
Boqun
> other case requiring it is hazptr nested usage (more than 8 active
> hazptr) on a thread context + nested irqs.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-18 20:22 ` Boqun Feng
@ 2025-12-18 23:36 ` Mathieu Desnoyers
2025-12-19 0:25 ` Boqun Feng
0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 23:36 UTC (permalink / raw)
To: Boqun Feng
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 2025-12-18 15:22, Boqun Feng wrote:
[...]
>>> Could you utilize this[1] to see a
>>> comparison of the reader-side performance against RCU/SRCU?
>>
>> Good point ! Let's see.
>>
>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>> hyperthreading disabled,
>> CONFIG_PREEMPT=y,
>> CONFIG_PREEMPT_RCU=y,
>> CONFIG_PREEMPT_HAZPTR=y.
>>
>> scale_type ns
>> -----------------------
>> hazptr-smp-mb 13.1 <- this implementation
>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
>> rcu 17.0
>> srcu 20.0
>> srcu-fast 1.5
>> rcu-tasks 0.0
>> rcu-trace 1.7
>> refcnt 1148.0
>> rwlock 1190.0
>> rwsem 4199.3
>> lock 41070.6
>> lock-irq 46176.3
>> acqrel 1.1
>>
>> So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
>> appear to beat hazptr read-side performance.
>>
>
> Could you also see the reader-side performance impact when the percpu
> hazard pointer slots are used up? I.e. the worst case.
I've modified the code to populate "(void *)1UL" in the 7 first slots
at bootup, here is the result:
hazptr-smp-mb-7-fail 16.3 ns
So we go from 13.1 ns to 16.3 ns when all but one slots are used.
And if we pre-populate the 8 slots for each cpu, and thus force
fallback to overflow list:
hazptr-smp-mb-8-fail 67.1 ns
>
>> [...]
>>
>>>> +/*
>>>> + * Perform piecewise iteration on overflow list waiting until "addr" is
>>>> + * not present. Raw spinlock is released and taken between each list
>>>> + * item and busy loop iteration. The overflow list generation is checked
>>>> + * each time the lock is taken to validate that the list has not changed
>>>> + * before resuming iteration or busy wait. If the generation has
>>>> + * changed, retry the entire list traversal.
>>>> + */
>>>> +static
>>>> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
>>>> +{
>>>> + struct hazptr_backup_slot *backup_slot;
>>>> + uint64_t snapshot_gen;
>>>> +
>>>> + raw_spin_lock(&overflow_list->lock);
>>>> +retry:
>>>> + snapshot_gen = overflow_list->gen;
>>>> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
>>>> + /* Busy-wait if node is found. */
>>>> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>>>> + raw_spin_unlock(&overflow_list->lock);
>>>> + cpu_relax();
>>>
>>> I think we should prioritize the scan thread solution [2] instead of
>>> busy waiting hazrd pointer updaters, because when we have multiple
>>> hazard pointer usages we would want to consolidate the scans from
>>> updater side.
>>
>> I agree that batching scans with a worker thread is a logical next step.
>>
>>> If so, the whole ->gen can be avoided.
>>
>> How would it allow removing the generation trick without causing long
>> raw spinlock latencies ?
>>
>
> Because we won't need to busy-wait for the readers to go away, we can
> check whether they are still there in the next scan.
>
> so:
>
> list_for_each_entry(backup_slot, &overflow_list->head, node) {
> /* Busy-wait if node is found. */
> if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> <mark addr as unable to free and move on>
But then you still iterate on a possibly large list of overflow nodes,
with a raw spinlock held. That raw spinlock is taken by the scheduler
on context switch. This can cause very long scheduler latency.
So breaking up the iteration into pieces is not just to handle
busy-waiting, but also to make sure we don't increase the
system latency by holding a raw spinlock (taken with rq lock
held) for more than the little time needed to iterate to the next
node.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-18 23:36 ` Mathieu Desnoyers
@ 2025-12-19 0:25 ` Boqun Feng
2025-12-19 6:06 ` Joel Fernandes
2025-12-19 15:14 ` Mathieu Desnoyers
0 siblings, 2 replies; 29+ messages in thread
From: Boqun Feng @ 2025-12-19 0:25 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On Thu, Dec 18, 2025 at 06:36:00PM -0500, Mathieu Desnoyers wrote:
> On 2025-12-18 15:22, Boqun Feng wrote:
> [...]
> > > > Could you utilize this[1] to see a
> > > > comparison of the reader-side performance against RCU/SRCU?
> > >
> > > Good point ! Let's see.
> > >
> > > On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> > > hyperthreading disabled,
> > > CONFIG_PREEMPT=y,
> > > CONFIG_PREEMPT_RCU=y,
> > > CONFIG_PREEMPT_HAZPTR=y.
> > >
> > > scale_type ns
> > > -----------------------
> > > hazptr-smp-mb 13.1 <- this implementation
> > > hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
> > > hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
> > > rcu 17.0
> > > srcu 20.0
> > > srcu-fast 1.5
> > > rcu-tasks 0.0
> > > rcu-trace 1.7
> > > refcnt 1148.0
> > > rwlock 1190.0
> > > rwsem 4199.3
> > > lock 41070.6
> > > lock-irq 46176.3
> > > acqrel 1.1
> > >
> > > So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
> > > appear to beat hazptr read-side performance.
> > >
> >
> > Could you also see the reader-side performance impact when the percpu
> > hazard pointer slots are used up? I.e. the worst case.
>
> I've modified the code to populate "(void *)1UL" in the 7 first slots
> at bootup, here is the result:
>
> hazptr-smp-mb-7-fail 16.3 ns
>
> So we go from 13.1 ns to 16.3 ns when all but one slots are used.
>
> And if we pre-populate the 8 slots for each cpu, and thus force
> fallback to overflow list:
>
> hazptr-smp-mb-8-fail 67.1 ns
>
Thank you! So involving locking seems to hurt performance more than
per-CPU/per-task operations. This may suggest that enabling
PREEMPT_HAZPTR by default has an acceptable performance.
> >
> > > [...]
> > >
> > > > > +/*
> > > > > + * Perform piecewise iteration on overflow list waiting until "addr" is
> > > > > + * not present. Raw spinlock is released and taken between each list
> > > > > + * item and busy loop iteration. The overflow list generation is checked
> > > > > + * each time the lock is taken to validate that the list has not changed
> > > > > + * before resuming iteration or busy wait. If the generation has
> > > > > + * changed, retry the entire list traversal.
> > > > > + */
> > > > > +static
> > > > > +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
> > > > > +{
> > > > > + struct hazptr_backup_slot *backup_slot;
> > > > > + uint64_t snapshot_gen;
> > > > > +
> > > > > + raw_spin_lock(&overflow_list->lock);
> > > > > +retry:
> > > > > + snapshot_gen = overflow_list->gen;
> > > > > + list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > > > > + /* Busy-wait if node is found. */
> > > > > + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> > > > > + raw_spin_unlock(&overflow_list->lock);
> > > > > + cpu_relax();
> > > >
> > > > I think we should prioritize the scan thread solution [2] instead of
> > > > busy waiting hazrd pointer updaters, because when we have multiple
> > > > hazard pointer usages we would want to consolidate the scans from
> > > > updater side.
> > >
> > > I agree that batching scans with a worker thread is a logical next step.
> > >
> > > > If so, the whole ->gen can be avoided.
> > >
> > > How would it allow removing the generation trick without causing long
> > > raw spinlock latencies ?
> > >
> >
> > Because we won't need to busy-wait for the readers to go away, we can
> > check whether they are still there in the next scan.
> >
> > so:
> >
> > list_for_each_entry(backup_slot, &overflow_list->head, node) {
> > /* Busy-wait if node is found. */
> > if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
> > <mark addr as unable to free and move on>
>
> But then you still iterate on a possibly large list of overflow nodes,
> with a raw spinlock held. That raw spinlock is taken by the scheduler
> on context switch. This can cause very long scheduler latency.
>
That's fair.
> So breaking up the iteration into pieces is not just to handle
> busy-waiting, but also to make sure we don't increase the
> system latency by holding a raw spinlock (taken with rq lock
> held) for more than the little time needed to iterate to the next
> node.
>
I agree that it helps reduce the latency, but I feel like with a scan
thread in the picture (and we don't need to busy-wait), we should use
a forward-progress-guaranteed way in the updater side scan, which means
we may need to explore other solutions for the latency (e.g.
fine-grained locking hashlist for the overflow list) than the generation
counter.
Regards,
Boqun
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-19 0:25 ` Boqun Feng
@ 2025-12-19 6:06 ` Joel Fernandes
2025-12-19 15:14 ` Mathieu Desnoyers
1 sibling, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2025-12-19 6:06 UTC (permalink / raw)
To: Boqun Feng
Cc: Mathieu Desnoyers, Paul E. McKenney, linux-kernel,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Will Deacon, Peter Zijlstra,
Alan Stern, John Stultz, Neeraj Upadhyay, Linus Torvalds,
Andrew Morton, Frederic Weisbecker, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm
Hi Boqun,
> On Dec 18, 2025, at 9:07 PM, Boqun Feng <boqun.feng@gmail.com> wrote:
>
> On Thu, Dec 18, 2025 at 06:36:00PM -0500, Mathieu Desnoyers wrote:
>> On 2025-12-18 15:22, Boqun Feng wrote:
>> [...]
>>>>> Could you utilize this[1] to see a
>>>>> comparison of the reader-side performance against RCU/SRCU?
>>>>
>>>> Good point ! Let's see.
>>>>
>>>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>>>> hyperthreading disabled,
>>>> CONFIG_PREEMPT=y,
>>>> CONFIG_PREEMPT_RCU=y,
>>>> CONFIG_PREEMPT_HAZPTR=y.
>>>>
>>>> scale_type ns
>>>> -----------------------
>>>> hazptr-smp-mb 13.1 <- this implementation
>>>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
>>>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
>>>> rcu 17.0
>>>> srcu 20.0
>>>> srcu-fast 1.5
>>>> rcu-tasks 0.0
>>>> rcu-trace 1.7
>>>> refcnt 1148.0
>>>> rwlock 1190.0
>>>> rwsem 4199.3
>>>> lock 41070.6
>>>> lock-irq 46176.3
>>>> acqrel 1.1
>>>>
>>>> So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
>>>> appear to beat hazptr read-side performance.
>>>>
>>>
>>> Could you also see the reader-side performance impact when the percpu
>>> hazard pointer slots are used up? I.e. the worst case.
>>
>> I've modified the code to populate "(void *)1UL" in the 7 first slots
>> at bootup, here is the result:
>>
>> hazptr-smp-mb-7-fail 16.3 ns
>>
>> So we go from 13.1 ns to 16.3 ns when all but one slots are used.
>>
>> And if we pre-populate the 8 slots for each cpu, and thus force
>> fallback to overflow list:
>>
>> hazptr-smp-mb-8-fail 67.1 ns
>>
>
> Thank you! So involving locking seems to hurt performance more than
> per-CPU/per-task operations. This may suggest that enabling
> PREEMPT_HAZPTR by default has an acceptable performance.
My impression is we do other locking on preemption anyway, such as the block list for preempted RCU critical sections. So maybe that's okay. As you said, it is an acceptable performance.
>>>
>>>> [...]
>>>>
>>>>>> +/*
>>>>>> + * Perform piecewise iteration on overflow list waiting until "addr" is
>>>>>> + * not present. Raw spinlock is released and taken between each list
>>>>>> + * item and busy loop iteration. The overflow list generation is checked
>>>>>> + * each time the lock is taken to validate that the list has not changed
>>>>>> + * before resuming iteration or busy wait. If the generation has
>>>>>> + * changed, retry the entire list traversal.
>>>>>> + */
>>>>>> +static
>>>>>> +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr)
>>>>>> +{
>>>>>> + struct hazptr_backup_slot *backup_slot;
>>>>>> + uint64_t snapshot_gen;
>>>>>> +
>>>>>> + raw_spin_lock(&overflow_list->lock);
>>>>>> +retry:
>>>>>> + snapshot_gen = overflow_list->gen;
>>>>>> + list_for_each_entry(backup_slot, &overflow_list->head, node) {
>>>>>> + /* Busy-wait if node is found. */
>>>>>> + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>>>>>> + raw_spin_unlock(&overflow_list->lock);
>>>>>> + cpu_relax();
>>>>>
>>>>> I think we should prioritize the scan thread solution [2] instead of
>>>>> busy waiting hazrd pointer updaters, because when we have multiple
>>>>> hazard pointer usages we would want to consolidate the scans from
>>>>> updater side.
Yeah the scan thread idea also fixes the scan cost issue with per-task slots if we batch. If we implement a separate hazard pointer flavor along those lines, then maybe we should definitely do a worker thread.
>>>> I agree that batching scans with a worker thread is a logical next step.
>>>>
>>>>> If so, the whole ->gen can be avoided.
>>>>
>>>> How would it allow removing the generation trick without causing long
>>>> raw spinlock latencies ?
>>>>
>>>
>>> Because we won't need to busy-wait for the readers to go away, we can
>>> check whether they are still there in the next scan.
>>>
>>> so:
>>>
>>> list_for_each_entry(backup_slot, &overflow_list->head, node) {
>>> /* Busy-wait if node is found. */
>>> if (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */
>>> <mark addr as unable to free and move on>
>>
>> But then you still iterate on a possibly large list of overflow nodes,
>> with a raw spinlock held. That raw spinlock is taken by the scheduler
>> on context switch. This can cause very long scheduler latency.
>>
>
> That's fair.
What about combining both approaches?
Can we do the generation trick along with worker thread scanning? I feel that should be doable.
>
>> So breaking up the iteration into pieces is not just to handle
>> busy-waiting, but also to make sure we don't increase the
>> system latency by holding a raw spinlock (taken with rq lock
>> held) for more than the little time needed to iterate to the next
>> node.
>>
>
> I agree that it helps reduce the latency, but I feel like with a scan
> thread in the picture (and we don't need to busy-wait), we should use
> a forward-progress-guaranteed way in the updater side scan, which means
> we may need to explore other solutions for the latency (e.g.
> fine-grained locking hashlist for the overflow list) than the generation
> counter.
Hmm.. That only works I guess if there is no interference between the finger grained list being iterated and the list being overflowed into. Otherwise, I think it might run into the same issue, but I could be missing something about the idea.
thanks,
- Joel
>
> Regards,
> Boqun
>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-19 0:25 ` Boqun Feng
2025-12-19 6:06 ` Joel Fernandes
@ 2025-12-19 15:14 ` Mathieu Desnoyers
2025-12-19 15:42 ` Joel Fernandes
1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-19 15:14 UTC (permalink / raw)
To: Boqun Feng
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 2025-12-18 19:25, Boqun Feng wrote:
[...]
>> And if we pre-populate the 8 slots for each cpu, and thus force
>> fallback to overflow list:
>>
>> hazptr-smp-mb-8-fail 67.1 ns
>>
>
> Thank you! So involving locking seems to hurt performance more than
> per-CPU/per-task operations. This may suggest that enabling
> PREEMPT_HAZPTR by default has an acceptable performance.
Indeed, I can fold it into the hazptr patch and remove the config
option then.
>
>> So breaking up the iteration into pieces is not just to handle
>> busy-waiting, but also to make sure we don't increase the
>> system latency by holding a raw spinlock (taken with rq lock
>> held) for more than the little time needed to iterate to the next
>> node.
>>
>
> I agree that it helps reduce the latency, but I feel like with a scan
> thread in the picture (and we don't need to busy-wait), we should use
> a forward-progress-guaranteed way in the updater side scan, which means
> we may need to explore other solutions for the latency (e.g.
> fine-grained locking hashlist for the overflow list) than the generation
> counter.
Guaranteeing forward progress of synchronize even with a steady stream
of unrelated hazard pointers addition/removal to/from the overflow list
is something we should aim for, with or without a scan thread.
As is, my current generation scheme does not guarantee this. But we can
use liburcu RCU grace period "parity" concept as inspiration [1] and
introduce a two-lists scheme, and have hazptr_synchronize flip the
current "list_add" head while it iterates on the other list. There would
be one generation counter for each of the two lists.
This would be protected by holding a global mutex across
hazptr_synchronize. hazptr_synchronize would need to iterate
on the two lists one after the other, carefully flipping the
current "addition list" head between the two iterations.
So the worse case that can happen in terms of retry caused by
generation counter increments is if list entries are deleted while
the list is being traversed by hazptr_synchronize. Because there
are no possible concurrent additions to that list, the worse case
is that the list becomes empty, which bounds the number of retry
to the number of list elements.
Thoughts ?
Thanks,
Mathieu
[1] https://github.com/urcu/userspace-rcu/blob/master/src/urcu.c#L409
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-19 15:14 ` Mathieu Desnoyers
@ 2025-12-19 15:42 ` Joel Fernandes
2025-12-19 22:19 ` Mathieu Desnoyers
0 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2025-12-19 15:42 UTC (permalink / raw)
To: Mathieu Desnoyers, Boqun Feng
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 12/19/2025 10:14 AM, Mathieu Desnoyers wrote:
> On 2025-12-18 19:25, Boqun Feng wrote:
> [...]
>>> And if we pre-populate the 8 slots for each cpu, and thus force
>>> fallback to overflow list:
>>>
>>> hazptr-smp-mb-8-fail 67.1 ns
>>>
>>
>> Thank you! So involving locking seems to hurt performance more than
>> per-CPU/per-task operations. This may suggest that enabling
>> PREEMPT_HAZPTR by default has an acceptable performance.
>
> Indeed, I can fold it into the hazptr patch and remove the config
> option then.
>
>>
>>> So breaking up the iteration into pieces is not just to handle
>>> busy-waiting, but also to make sure we don't increase the
>>> system latency by holding a raw spinlock (taken with rq lock
>>> held) for more than the little time needed to iterate to the next
>>> node.
>>>
>>
>> I agree that it helps reduce the latency, but I feel like with a scan
>> thread in the picture (and we don't need to busy-wait), we should use
>> a forward-progress-guaranteed way in the updater side scan, which means
>> we may need to explore other solutions for the latency (e.g.
>> fine-grained locking hashlist for the overflow list) than the generation
>> counter.
>
> Guaranteeing forward progress of synchronize even with a steady stream
> of unrelated hazard pointers addition/removal to/from the overflow list
> is something we should aim for, with or without a scan thread.
>
> As is, my current generation scheme does not guarantee this. But we can
> use liburcu RCU grace period "parity" concept as inspiration [1] and
> introduce a two-lists scheme, and have hazptr_synchronize flip the
> current "list_add" head while it iterates on the other list. There would
> be one generation counter for each of the two lists.
>
> This would be protected by holding a global mutex across
> hazptr_synchronize. hazptr_synchronize would need to iterate
> on the two lists one after the other, carefully flipping the
> current "addition list" head between the two iterations.
>
> So the worse case that can happen in terms of retry caused by
> generation counter increments is if list entries are deleted while
> the list is being traversed by hazptr_synchronize. Because there
> are no possible concurrent additions to that list, the worse case
> is that the list becomes empty, which bounds the number of retry
> to the number of list elements.
>
> Thoughts ?
IMHO the overflow case is "special" and should not happen often, otherwise
things are "bad" anyway. I am not sure if this kind of complexity will be worth
it unless we know HP forward-progress is a real problem. Also, since HP acquire
will be short lived, are we that likely to not get past a temporary shortage of
slots?
Perhaps the forward-progress problem should be rephrased to the following?: If a
reader hit an overflow slot, it should probably be able to get a non-overflow
slot soon, even if hazard pointer slots are over-subscribed.
Thanks.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-19 15:42 ` Joel Fernandes
@ 2025-12-19 22:19 ` Mathieu Desnoyers
2025-12-19 22:39 ` Joel Fernandes
0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-19 22:19 UTC (permalink / raw)
To: Joel Fernandes, Boqun Feng
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 2025-12-19 10:42, Joel Fernandes wrote:
>
> IMHO the overflow case is "special" and should not happen often, otherwise
> things are "bad" anyway. I am not sure if this kind of complexity will be worth
> it unless we know HP forward-progress is a real problem. Also, since HP acquire
> will be short lived, are we that likely to not get past a temporary shortage of
> slots?
Given that we have context switch integration which moves the active
per-cpu slots to the overflow list, I can see that even not-so-special
users which get preempted or block regularly with active per-cpu slots
could theoretically end up preventing HP scan progress.
Providing HP scan progress guarantees is IMO an important aspect to
consider if we want to ensure the implementation does not cause subtle
HP scan stall when its amount of use will scale up.
> Perhaps the forward-progress problem should be rephrased to the following?: If a
> reader hit an overflow slot, it should probably be able to get a non-overflow
> slot soon, even if hazard pointer slots are over-subscribed.
Those are two distinct forward progress guarantees, and I think both are
important:
* Forward progress of HP readers,
* Forward progress of HP scan.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-19 22:19 ` Mathieu Desnoyers
@ 2025-12-19 22:39 ` Joel Fernandes
2025-12-21 9:59 ` Boqun Feng
0 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2025-12-19 22:39 UTC (permalink / raw)
To: Mathieu Desnoyers, Boqun Feng
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 12/19/2025 5:19 PM, Mathieu Desnoyers wrote:
> On 2025-12-19 10:42, Joel Fernandes wrote:
>>
>> IMHO the overflow case is "special" and should not happen often, otherwise
>> things are "bad" anyway. I am not sure if this kind of complexity will be worth
>> it unless we know HP forward-progress is a real problem. Also, since HP acquire
>> will be short lived, are we that likely to not get past a temporary shortage of
>> slots?
>
> Given that we have context switch integration which moves the active
> per-cpu slots to the overflow list, I can see that even not-so-special
> users which get preempted or block regularly with active per-cpu slots
> could theoretically end up preventing HP scan progress.
Yeah, I see. That's the 'problem' with the preemption design of this patchset.
You always have to move it in and out of overflow list on preemption even when
there is no overflow AFAICS. My idea (which has its own issues) does not require
that on preemption.
> Providing HP scan progress guarantees is IMO an important aspect to
> consider if we want to ensure the implementation does not cause subtle
> HP scan stall when its amount of use will scale up.
Sure. But if you didn't have to deal with a list in the 'normal' case (not over
saturated slots case), then it wouldn't be that big a deal.
>> Perhaps the forward-progress problem should be rephrased to the following?: If a
>> reader hit an overflow slot, it should probably be able to get a non-overflow
>> slot soon, even if hazard pointer slots are over-subscribed.
>
> Those are two distinct forward progress guarantees, and I think both are
> important:
>
> * Forward progress of HP readers,
> * Forward progress of HP scan.
Maybe I am missing something, but AFAICS, if the readers and only using slots
and not locking in normal operation, then the scan also will automatically make
forward progress. So both are forward progress of readers and scanning related.
It is your preemption design that requires the locking..
Btw, I thought about the scanning issue with the task-slots idea, and really
synchronize() is supposed to be a slow-path so I am not fully sure it is that
much of an issue - again depends on usecase but for per-cpu ref and
synchronize_rcu(), both are 'slow' anyway. Again depends on usecase. And for the
async case, it is almost not an issue at all due to the batching/amortization of
the task scan cost.
thanks,
- Joel
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-19 22:39 ` Joel Fernandes
@ 2025-12-21 9:59 ` Boqun Feng
0 siblings, 0 replies; 29+ messages in thread
From: Boqun Feng @ 2025-12-21 9:59 UTC (permalink / raw)
To: Joel Fernandes
Cc: Mathieu Desnoyers, Joel Fernandes, Paul E. McKenney,
linux-kernel, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Peter Zijlstra, Alan Stern, John Stultz, Linus Torvalds,
Andrew Morton, Frederic Weisbecker, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, rcu, linux-mm,
lkmm
On Fri, Dec 19, 2025 at 05:39:00PM -0500, Joel Fernandes wrote:
>
>
> On 12/19/2025 5:19 PM, Mathieu Desnoyers wrote:
> > On 2025-12-19 10:42, Joel Fernandes wrote:
> >>
> >> IMHO the overflow case is "special" and should not happen often, otherwise
> >> things are "bad" anyway. I am not sure if this kind of complexity will be worth
> >> it unless we know HP forward-progress is a real problem. Also, since HP acquire
> >> will be short lived, are we that likely to not get past a temporary shortage of
> >> slots?
> >
> > Given that we have context switch integration which moves the active
> > per-cpu slots to the overflow list, I can see that even not-so-special
> > users which get preempted or block regularly with active per-cpu slots
> > could theoretically end up preventing HP scan progress.
>
> Yeah, I see. That's the 'problem' with the preemption design of this patchset.
> You always have to move it in and out of overflow list on preemption even when
> there is no overflow AFAICS. My idea (which has its own issues) does not require
> that on preemption.
>
> > Providing HP scan progress guarantees is IMO an important aspect to
> > consider if we want to ensure the implementation does not cause subtle
> > HP scan stall when its amount of use will scale up.
>
> Sure. But if you didn't have to deal with a list in the 'normal' case (not over
> saturated slots case), then it wouldn't be that big a deal.
>
With PREEPT_HAZPTR, the overflow list will be used if a task gets
preempted while using hazptrs, without PREEPT_HAZPTR, the overflow list
will be used if there are more than 8 slots used on the CPU (most likely
due to task being preempted with a hazptr slot being used). Therefore,
if this hazptr design it to have a few users and we want to support that
preemptible hazard pointer read-side section, soon or later we need to
resolve this scanning racing with readers (in context switches) on
overflow list issue.
Surely we don't need to have something perfect in day 1, but planning a
bit ahead won't hurt ;-)
Regards,
Boqun
> >> Perhaps the forward-progress problem should be rephrased to the following?: If a
> >> reader hit an overflow slot, it should probably be able to get a non-overflow
> >> slot soon, even if hazard pointer slots are over-subscribed.
> >
> > Those are two distinct forward progress guarantees, and I think both are
> > important:
> >
> > * Forward progress of HP readers,
> > * Forward progress of HP scan.
>
[..]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-18 17:35 ` Mathieu Desnoyers
2025-12-18 20:22 ` Boqun Feng
@ 2025-12-19 0:43 ` Boqun Feng
2025-12-19 14:22 ` Mathieu Desnoyers
1 sibling, 1 reply; 29+ messages in thread
From: Boqun Feng @ 2025-12-19 0:43 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
[...]
> > Could you utilize this[1] to see a
> > comparison of the reader-side performance against RCU/SRCU?
>
> Good point ! Let's see.
>
> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
> hyperthreading disabled,
> CONFIG_PREEMPT=y,
> CONFIG_PREEMPT_RCU=y,
> CONFIG_PREEMPT_HAZPTR=y.
>
> scale_type ns
> -----------------------
> hazptr-smp-mb 13.1 <- this implementation
> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
> rcu 17.0
Hmm.. now looking back, how is it possible that hazptr is faster than
RCU on the reader-side? Because a grace period was happening and
triggered rcu_read_unlock_special()? This is actualy more interesting.
Regards,
Boqun
> srcu 20.0
> srcu-fast 1.5
> rcu-tasks 0.0
> rcu-trace 1.7
> refcnt 1148.0
> rwlock 1190.0
> rwsem 4199.3
> lock 41070.6
> lock-irq 46176.3
> acqrel 1.1
>
> So only srcu-fast, rcu-tasks, rcu-trace and a plain acqrel
> appear to beat hazptr read-side performance.
>
> [...]
>
[...]
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-19 0:43 ` Boqun Feng
@ 2025-12-19 14:22 ` Mathieu Desnoyers
0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-19 14:22 UTC (permalink / raw)
To: Boqun Feng
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 2025-12-18 19:43, Boqun Feng wrote:
> On Thu, Dec 18, 2025 at 12:35:18PM -0500, Mathieu Desnoyers wrote:
> [...]
>>> Could you utilize this[1] to see a
>>> comparison of the reader-side performance against RCU/SRCU?
>>
>> Good point ! Let's see.
>>
>> On a AMD 2x EPYC 9654 96-Core Processor with 192 cores,
>> hyperthreading disabled,
>> CONFIG_PREEMPT=y,
>> CONFIG_PREEMPT_RCU=y,
>> CONFIG_PREEMPT_HAZPTR=y.
>>
>> scale_type ns
>> -----------------------
>> hazptr-smp-mb 13.1 <- this implementation
>> hazptr-barrier 11.5 <- replace smp_mb() on acquire with barrier(), requires IPIs on synchronize.
>> hazptr-smp-mb-hlist 12.7 <- replace per-task hp context and per-cpu overflow lists by hlist.
>> rcu 17.0
>
> Hmm.. now looking back, how is it possible that hazptr is faster than
> RCU on the reader-side? Because a grace period was happening and
> triggered rcu_read_unlock_special()? This is actualy more interesting.
So I could be entirely misreading the code, but, we have:
rcu_flavor_sched_clock_irq():
[...]
/* If GP is oldish, ask for help from rcu_read_unlock_special(). */
if (rcu_preempt_depth() > 0 &&
__this_cpu_read(rcu_data.core_needs_qs) &&
__this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
!t->rcu_read_unlock_special.b.need_qs &&
time_after(jiffies, rcu_state.gp_start + HZ))
t->rcu_read_unlock_special.b.need_qs = true;
which means we set need_qs = true as a result from observing
cpu_no_qs.b.norm == true.
This is sufficient to trigger calls (plural) to rcu_read_unlock_special()
from __rcu_read_unlock.
But then if we look at rcu_preempt_deferred_qs_irqrestore()
which we would expect to clear the rcu_read_unlock_special.b.need_qs
state, we have this:
special = t->rcu_read_unlock_special;
if (!special.s && !rdp->cpu_no_qs.b.exp) {
local_irq_restore(flags);
return;
}
t->rcu_read_unlock_special.s = 0;
which skips over clearing the state unless there is an expedited
grace period required.
So unless I'm missing something, we should _also_ clear that state
when it's invoked after rcu_flavor_sched_clock_irq, so the next
__rcu_read_unlock won't all call into rcu_read_unlock_special().
I'm adding a big warning about sleep deprivation and possibly
misunderstanding the whole thing. What am I missing ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
2025-12-18 1:45 ` [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers Mathieu Desnoyers
2025-12-18 8:36 ` Boqun Feng
@ 2025-12-19 1:22 ` Joel Fernandes
1 sibling, 0 replies; 29+ messages in thread
From: Joel Fernandes @ 2025-12-19 1:22 UTC (permalink / raw)
To: Mathieu Desnoyers, Boqun Feng, Joel Fernandes, Paul E. McKenney
Cc: linux-kernel, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Peter Zijlstra, Alan Stern, John Stultz, Neeraj Upadhyay,
Linus Torvalds, Andrew Morton, Frederic Weisbecker,
Josh Triplett, Uladzislau Rezki, Steven Rostedt, Lai Jiangshan,
Zqiang, Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm
On 12/18/2025 12:54 PM, Mathieu Desnoyers wrote:
[..]
>>> <mathieu.desnoyers@efficios.com> wrote:
[..]
>>> Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
>>> if you guys have time to try it out with your use-cases it would be
>>> great!
>>>
>>> This new version does the following:
>>>
>>> - It has 8 preallocated hazard pointer slots per CPU (one cache line),
>>> - The hazard pointer user allocates a hazard pointer context variable
>>> (typically on the stack), which contains the pointer to the slot *and*
>>> a backup slot,
>>> - When all the per-CPU slots are in use, fallback to the backup slot.
>>> Chain the backup slot into per-CPU lists, each protected by a raw
>>> spinlock.
>>> - The hazard pointer synchronize does a piecewise iteration on the
>>> per-CPU overflow slots lists, releasing the raw spinlock between
>>> each list item. It uses a 64-bit generation counter to check for
>>> concurrent list changes, and restart the traversal on generation
>>> counter mismatch.
>>> - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
>>> the hazard pointer acquire/release adds and then removes the hazard
>>> pointer context from a per-task linked list. On context switch, the
>>> scheduler migrates the per-CPU slots used by the task to the backup
>>> per-context slots, thus making sure the per-CPU slots are not used
>>> by preempted and blocked tasks.
>>
>> This last point is another reason why I want the slots to be per task instead
>> of per CPU. It becomes very natural because the hazard pointer is always
>> associated with a task only anyway, not with the CPU (at usecase level). By
>> putting the slot in the task struct, we allow these requirements to flow
>> naturally without requiring any locking or list management.. Did I miss
>> something about the use cases?
>
> I start from the hypothesis that scanning all tasks at synchronize
> is likely too slow for a general purpose synchronization algo.
>
> This is why we have RCU (general purpose) tracking hierarchical per-cpu
> state, and specialized RCU flavors (for tracing and other use-cases)
> using iteration on all tasks.
>
I wouldn't call HP as "general" personally, it could be (should be?) usecase
specific. That's why I am not opposed to the per-cpu approach, but I am not
convinced that there is no room for a per-task slot approach which is simpler
in many respects in the code.
If usecase has 1000s of CPUs but few tasks running, then per-task would be way
faster. It would be even cheaper than per-cpu slots on the read-side:
1. Unlikely to overflow relatively speaking, since plenty of slots distributed
to tasks, thus avoiding locking.
2. Preemption also doesn't require saving/restoring and locking.
I would look at per-task HP as a more specific HP implementation if you will and
per-cpu as more leaning towards "general". There's tradeoffs of course.
> One of the highlights of hazard pointers over RCU is its ability to
> know about quiescence of a pointer without waiting for a whole grace
> period. Therefore I went down the route of scanning per-CPU state rather
> than per-task.
Sure, but there could be room for both flavors, no?
>> I did some measurements about the task-scanning issue, and it is fast in my
>> testing (~1ms/10000 tasks). Any input from you or anyone on what the typical
>> task count distribution is that we are addressing? I also made a rough
>> prototype, and it appears to be simpler with fewer lines of code because I do
>> not need to handle preemption. It just happens naturally.
>
> It really depends on the access pattern here. If you want to replace
> refcount decrement and test with hazard pointer synchronize, a delay of
> 1ms on each hazptr synchronize is a *lot*.
Yeah true but I don't think *that* is a fair comparison. The slow path is always
going to be slower than an atomic refcount no matter which HP approach we talk.
I think percpu_ref_kill is a better comparison which triggers a GP:
percpu_ref_kill()
- percpu_ref_kill_and_confirm()
...
- call_rcu_hurry(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu)
> Of course perhaps this can be batched with a worker thread, but then
> you are back to using more memory due to delayed reclaim, and thus
> closer to RCU.
If you have millions of objects with per-cpu ref count, that can go into the
megabytes. With HP, you can negate that. So even with batched worker thread,
there can be space savings and fast read-side (at the expense of slightly more
overhead on write side). Sure lockdep usecase does not benefit since Boqun wants
to speed up the write side in that case (synchronize_rcu_expedited) but we
probably don't want to discount other potential HP usecases which want faster
simpler read side while having preemptability, small structures etc and don't
care about update side performance..
>> First of all, we can have a per-task counter that tracks how many hazard
>> pointers are active. If this is zero, then we can simply skip the taskinstead
>> of wasting cycles scanning all the task slot.
>> Further, we can have a retire list that reuses a single scan to scan all the
>> objects in the retire list, thus reusing the scan cost. This can also assist
>> in asynchronously implementing object retiring via a dedicated thread perhaps
>> with the tasks RCU infrastructure. We can also make this per-task counter a
>> bitmap to speed up scanning potentially.
>>
>> I am okay with the concept of an overflow list, but if we keep the overflow
>> list at the per-task level instead of the per-CPU level, it is highlyunlikely
>> IMO that such an overflow list will be used unless more than, say, eight
>> hazard pointers per task are active at any given time.
>
> The PREEMPT_HAZPTR scheme achieves this as well, with per-CPU scan
> rather than per-task.
Sure, but with additional complexity and locking that per-task slot does not at
all have.
>
>> So its lock contention would be rarer than, say, having a per-CPU overflow
>> list. I would say that contention would be incredibly rare because typically
>> hazard pointers are used by multiple tasks, each of which will have its own
>> unique set of slots. Whereas in a per-CPU overflow approach, we have ahigher
>> chance of lock contention, especially when the number of CPUs is low.
>
> Contention on the per-CPU spinlocks would only happen if threads are
> migrated between chaining of their backup slot (either if 8 hazptr
> are in use by the thread, or on context switch), and release. Do
> you expect this type of migration to happen so often as to increase
> contention ?
Is that really true? You have contention also when you exhaust all per-cpu slots
not just on migration:
+ /*
+ * If all the per-CPU slots are already in use, fallback
+ * to the backup slot.
+ */
+ if (unlikely(!slot))
+ slot = hazptr_chain_backup_slot(ctx);
>
>>
>> Other than the task-scanning performance issue, what am I missing?
>
> Task-scan performance is the key issue. Also you have to set a limit
> of per-task slots. How would you handle the scenario where a user
> needs more slots than the limit ?
Sure there is a limit but I wager that this limit is harder to hit than with the
per-cpu case. If you have smaller number of CPUs, you might be hitting this all
the time with per-cpu approach. I'd have a overflow node in task_struct to
handle this. Billions of Linux systems have 8-16 CPUs.
>> Another nice benefit of using per-task hazard pointers is that we can also
>> implement sleeping in hazard pointer sections because we will be scanning for
>> sleeping tasks as well.
>
> The PREEMPT_HAZPTR scheme takes care of this as well. Sleeping with a
> hazptr held will trigger the context switch, and the scheduler will
> simply move the hazard pointer from the per-CPU slot to the backup
> slot. End of story.
Yeah, but the per-task approach does not need that at all. In fact the hazard
pointer context structure is smaller.
>> By contrast, the other approaches I have seen with per-CPU hazard pointers
>> forbid sleeping, since after sleeping a task is no longer associated with its
>> CPU. The other approaches also have a higher likelihood of locking Due to
>> running out of slots.
>
> Except for PREEMPT_HAZPTR. :)
Ah you mean, because you make a copy of the slots into the task. But why not
just keep it in the task to begin with :-D (your points on task scan latency is
valid though..).
>> Of course I am missing a use case, but I suspect we can find a per-CPU ref-
>> count use case that benefits from this. I am researching use cases when I get
>> time. I think my next task is to find a solid use case for this
> before doing further development of a solution..
>
> Let me know what you find, I'm very interested!
Sure :) will do, thanks.
>> By the way, feedback on the scanning patch:
>>
>> Can you consider using a per-CPU counter to track the number of active slots
>> per CPU? That way you can ignore CPU slots for CPUs that are not using hazard
>> pointers. Another idea is to skip idle CPUs as well.
>
> I had this in a userspace prototype: a counter of used per-cpu slots.
> But I finally decided against it because it requires extra instructions
> on the fast path, *and* scanning 8 pointers within a single cache line
> on synchronize is really cheap. So I'd need benchmarks to justify adding
> back the counter.
Interesting, makes sense.
>> Have you also considered any asynchronous use case where maintaining a
>> retired list would assist in RCU-style deferred reclaim of hazard-pointer
objects?
>
> This would be a logical next step indeed. I'll have to think
> about this some more.
>
You might be able to re-use some of the RCU async processing machinery for that.
Or maybe using timers/workqueues.
Thanks.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch
2025-12-18 1:45 [RFC PATCH v4 0/4] Hazard Pointers Mathieu Desnoyers
` (2 preceding siblings ...)
2025-12-18 1:45 ` [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers Mathieu Desnoyers
@ 2025-12-18 1:45 ` Mathieu Desnoyers
2025-12-18 16:20 ` Mathieu Desnoyers
2025-12-18 22:16 ` Boqun Feng
2025-12-18 10:33 ` [RFC PATCH v4 0/4] Hazard Pointers Joel Fernandes
4 siblings, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 1:45 UTC (permalink / raw)
To: Boqun Feng, Joel Fernandes, Paul E. McKenney
Cc: linux-kernel, Mathieu Desnoyers, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
Integrate with the scheduler to migrate per-CPU slots to the backup slot
on context switch. This ensures that the per-CPU slots won't be used by
blocked or preempted tasks holding on hazard pointers for a long time.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
---
include/linux/hazptr.h | 63 ++++++++++++++++++++++++++++++++++++++++--
include/linux/sched.h | 4 +++
init/init_task.c | 3 ++
kernel/Kconfig.preempt | 10 +++++++
kernel/fork.c | 3 ++
kernel/sched/core.c | 2 ++
6 files changed, 83 insertions(+), 2 deletions(-)
diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
index 70c066ddb0f5..10ac53a42a7a 100644
--- a/include/linux/hazptr.h
+++ b/include/linux/hazptr.h
@@ -24,6 +24,7 @@
#include <linux/percpu.h>
#include <linux/types.h>
#include <linux/cleanup.h>
+#include <linux/sched.h>
/* 8 slots (each sizeof(void *)) fit in a single cache line. */
#define NR_HAZPTR_PERCPU_SLOTS 8
@@ -46,6 +47,9 @@ struct hazptr_ctx {
struct hazptr_slot *slot;
/* Backup slot in case all per-CPU slots are used. */
struct hazptr_backup_slot backup_slot;
+#ifdef CONFIG_PREEMPT_HAZPTR
+ struct list_head preempt_node;
+#endif
};
struct hazptr_percpu_slots {
@@ -98,6 +102,50 @@ bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot)
return slot == &ctx->backup_slot.slot;
}
+#ifdef CONFIG_PREEMPT_HAZPTR
+static inline
+void hazptr_chain_task_ctx(struct hazptr_ctx *ctx)
+{
+ list_add(&ctx->preempt_node, ¤t->hazptr_ctx_list);
+}
+
+static inline
+void hazptr_unchain_task_ctx(struct hazptr_ctx *ctx)
+{
+ list_del(&ctx->preempt_node);
+}
+
+static inline
+void hazptr_note_context_switch(void)
+{
+ struct hazptr_ctx *ctx;
+
+ list_for_each_entry(ctx, ¤t->hazptr_ctx_list, preempt_node) {
+ struct hazptr_slot *slot;
+
+ if (hazptr_slot_is_backup(ctx, ctx->slot))
+ continue;
+ slot = hazptr_chain_backup_slot(ctx);
+ /*
+ * Move hazard pointer from per-CPU slot to backup slot.
+ * This requires hazard pointer synchronize to iterate
+ * on per-CPU slots with load-acquire before iterating
+ * on the overflow list.
+ */
+ WRITE_ONCE(slot->addr, ctx->slot->addr);
+ /*
+ * store-release orders store to backup slot addr before
+ * store to per-CPU slot addr.
+ */
+ smp_store_release(&ctx->slot->addr, NULL);
+ }
+}
+#else
+static inline void hazptr_chain_task_ctx(struct hazptr_ctx *ctx) { }
+static inline void hazptr_unchain_task_ctx(struct hazptr_ctx *ctx) { }
+static inline void hazptr_note_context_switch(void) { }
+#endif
+
/*
* hazptr_acquire: Load pointer at address and protect with hazard pointer.
*
@@ -114,6 +162,7 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
struct hazptr_slot *slot = NULL;
void *addr, *addr2;
+ ctx->slot = NULL;
/*
* Load @addr_p to know which address should be protected.
*/
@@ -121,7 +170,9 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
for (;;) {
if (!addr)
return NULL;
+
guard(preempt)();
+ hazptr_chain_task_ctx(ctx);
if (likely(!hazptr_slot_is_backup(ctx, slot))) {
slot = hazptr_get_free_percpu_slot();
/*
@@ -140,8 +191,11 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
* Re-load @addr_p after storing it to the hazard pointer slot.
*/
addr2 = READ_ONCE(*addr_p); /* Load A */
- if (likely(ptr_eq(addr2, addr)))
+ if (likely(ptr_eq(addr2, addr))) {
+ ctx->slot = slot;
+ /* Success. Break loop, enable preemption and return. */
break;
+ }
/*
* If @addr_p content has changed since the first load,
* release the hazard pointer and try again.
@@ -150,11 +204,14 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
if (!addr2) {
if (hazptr_slot_is_backup(ctx, slot))
hazptr_unchain_backup_slot(ctx);
+ hazptr_unchain_task_ctx(ctx);
+ /* Loaded NULL. Enable preemption and return NULL. */
return NULL;
}
addr = addr2;
+ hazptr_unchain_task_ctx(ctx);
+ /* Enable preemption and retry. */
}
- ctx->slot = slot;
/*
* Use addr2 loaded from the second READ_ONCE() to preserve
* address dependency ordering.
@@ -170,11 +227,13 @@ void hazptr_release(struct hazptr_ctx *ctx, void *addr)
if (!addr)
return;
+ guard(preempt)();
slot = ctx->slot;
WARN_ON_ONCE(slot->addr != addr);
smp_store_release(&slot->addr, NULL);
if (unlikely(hazptr_slot_is_backup(ctx, slot)))
hazptr_unchain_backup_slot(ctx);
+ hazptr_unchain_task_ctx(ctx);
}
void hazptr_init(void);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878de25c..bbec9fd6b163 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -933,6 +933,10 @@ struct task_struct {
struct rcu_node *rcu_blocked_node;
#endif /* #ifdef CONFIG_PREEMPT_RCU */
+#ifdef CONFIG_PREEMPT_HAZPTR
+ struct list_head hazptr_ctx_list;
+#endif
+
#ifdef CONFIG_TASKS_RCU
unsigned long rcu_tasks_nvcsw;
u8 rcu_tasks_holdout;
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206f..117aebf5573a 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -160,6 +160,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
.rcu_blocked_node = NULL,
#endif
+#ifdef CONFIG_PREEMPT_HAZPTR
+ .hazptr_ctx_list = LIST_HEAD_INIT(init_task.hazptr_ctx_list),
+#endif
#ifdef CONFIG_TASKS_RCU
.rcu_tasks_holdout = false,
.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index da326800c1c9..beb351b42b7c 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -189,3 +189,13 @@ config SCHED_CLASS_EXT
For more information:
Documentation/scheduler/sched-ext.rst
https://github.com/sched-ext/scx
+
+config PREEMPT_HAZPTR
+ bool "Move Hazard Pointers to Task Slots on Context Switch"
+ help
+ Integrate hazard pointers with the scheduler so the active
+ hazard pointers using preallocated per-CPU slots are moved to
+ their context local slot on context switch. This prevents
+ blocked or preempted tasks to hold on to per-CPU slots for
+ a long time, which would cause higher overhead for short
+ hazard pointer critical sections.
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..35c810fe744e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1780,6 +1780,9 @@ static inline void rcu_copy_process(struct task_struct *p)
p->rcu_blocked_node = NULL;
INIT_LIST_HEAD(&p->rcu_node_entry);
#endif /* #ifdef CONFIG_PREEMPT_RCU */
+#ifdef CONFIG_PREEMPT_HAZPTR
+ INIT_LIST_HEAD(&p->hazptr_ctx_list);
+#endif /* #ifdef CONFIG_PREEMPT_HAZPTR */
#ifdef CONFIG_TASKS_RCU
p->rcu_tasks_holdout = false;
INIT_LIST_HEAD(&p->rcu_tasks_holdout_list);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f754a60de848..ac8bf2708140 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -60,6 +60,7 @@
#include <linux/profile.h>
#include <linux/psi.h>
#include <linux/rcuwait_api.h>
+#include <linux/hazptr.h>
#include <linux/rseq.h>
#include <linux/sched/wake_q.h>
#include <linux/scs.h>
@@ -6812,6 +6813,7 @@ static void __sched notrace __schedule(int sched_mode)
local_irq_disable();
rcu_note_context_switch(preempt);
+ hazptr_note_context_switch();
/*
* Make sure that signal_pending_state()->signal_pending() below
--
2.39.5
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch
2025-12-18 1:45 ` [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch Mathieu Desnoyers
@ 2025-12-18 16:20 ` Mathieu Desnoyers
2025-12-18 22:16 ` Boqun Feng
1 sibling, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 16:20 UTC (permalink / raw)
To: Boqun Feng, Joel Fernandes, Paul E. McKenney
Cc: linux-kernel, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Peter Zijlstra, Alan Stern, John Stultz, Neeraj Upadhyay,
Linus Torvalds, Andrew Morton, Frederic Weisbecker,
Josh Triplett, Uladzislau Rezki, Steven Rostedt, Lai Jiangshan,
Zqiang, Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm
On 2025-12-17 20:45, Mathieu Desnoyers wrote:
[...]
> +static inline
> +void hazptr_note_context_switch(void)
> +{
> + struct hazptr_ctx *ctx;
> +
> + list_for_each_entry(ctx, ¤t->hazptr_ctx_list, preempt_node) {
> + struct hazptr_slot *slot;
> +
> + if (hazptr_slot_is_backup(ctx, ctx->slot))
> + continue;
> + slot = hazptr_chain_backup_slot(ctx);
> + /*
> + * Move hazard pointer from per-CPU slot to backup slot.
> + * This requires hazard pointer synchronize to iterate
> + * on per-CPU slots with load-acquire before iterating
> + * on the overflow list.
> + */
> + WRITE_ONCE(slot->addr, ctx->slot->addr);
> + /*
> + * store-release orders store to backup slot addr before
> + * store to per-CPU slot addr.
> + */
> + smp_store_release(&ctx->slot->addr, NULL);
I missed this:
+ /* Use the backup slot for context. */
+ ctx->slot = slot;
This triggered the hazptr_release WARN_ON_ONCE() within modified
refscale tests.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch
2025-12-18 1:45 ` [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch Mathieu Desnoyers
2025-12-18 16:20 ` Mathieu Desnoyers
@ 2025-12-18 22:16 ` Boqun Feng
2025-12-19 0:21 ` Mathieu Desnoyers
1 sibling, 1 reply; 29+ messages in thread
From: Boqun Feng @ 2025-12-18 22:16 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On Wed, Dec 17, 2025 at 08:45:31PM -0500, Mathieu Desnoyers wrote:
> Integrate with the scheduler to migrate per-CPU slots to the backup slot
> on context switch. This ensures that the per-CPU slots won't be used by
> blocked or preempted tasks holding on hazard pointers for a long time.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: John Stultz <jstultz@google.com>
> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Zqiang <qiang.zhang1211@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: maged.michael@gmail.com
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> Cc: rcu@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: lkmm@lists.linux.dev
> ---
> include/linux/hazptr.h | 63 ++++++++++++++++++++++++++++++++++++++++--
> include/linux/sched.h | 4 +++
> init/init_task.c | 3 ++
> kernel/Kconfig.preempt | 10 +++++++
> kernel/fork.c | 3 ++
> kernel/sched/core.c | 2 ++
> 6 files changed, 83 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h
> index 70c066ddb0f5..10ac53a42a7a 100644
> --- a/include/linux/hazptr.h
> +++ b/include/linux/hazptr.h
> @@ -24,6 +24,7 @@
> #include <linux/percpu.h>
> #include <linux/types.h>
> #include <linux/cleanup.h>
> +#include <linux/sched.h>
>
> /* 8 slots (each sizeof(void *)) fit in a single cache line. */
> #define NR_HAZPTR_PERCPU_SLOTS 8
> @@ -46,6 +47,9 @@ struct hazptr_ctx {
> struct hazptr_slot *slot;
> /* Backup slot in case all per-CPU slots are used. */
> struct hazptr_backup_slot backup_slot;
> +#ifdef CONFIG_PREEMPT_HAZPTR
I would suggest we make CONFIG_PREEMPT_HAZPTR always enabled hence no
need for a config, do we have the measurement of the additional cost?
> + struct list_head preempt_node;
> +#endif
> };
>
> struct hazptr_percpu_slots {
> @@ -98,6 +102,50 @@ bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot)
> return slot == &ctx->backup_slot.slot;
> }
>
> +#ifdef CONFIG_PREEMPT_HAZPTR
> +static inline
> +void hazptr_chain_task_ctx(struct hazptr_ctx *ctx)
> +{
> + list_add(&ctx->preempt_node, ¤t->hazptr_ctx_list);
> +}
> +
> +static inline
> +void hazptr_unchain_task_ctx(struct hazptr_ctx *ctx)
> +{
> + list_del(&ctx->preempt_node);
> +}
> +
I think you need to add interrupt disabling for chain/unchain because of
the potential readers in interrupt and then you can avoid the preempt
disabling in hazptr_release() I think. Let's aim for supporting readers
in interrupt handler, because at least lockdep needs that.
Regards,
Boqun
> +static inline
> +void hazptr_note_context_switch(void)
> +{
> + struct hazptr_ctx *ctx;
> +
> + list_for_each_entry(ctx, ¤t->hazptr_ctx_list, preempt_node) {
> + struct hazptr_slot *slot;
> +
> + if (hazptr_slot_is_backup(ctx, ctx->slot))
> + continue;
> + slot = hazptr_chain_backup_slot(ctx);
> + /*
> + * Move hazard pointer from per-CPU slot to backup slot.
> + * This requires hazard pointer synchronize to iterate
> + * on per-CPU slots with load-acquire before iterating
> + * on the overflow list.
> + */
> + WRITE_ONCE(slot->addr, ctx->slot->addr);
> + /*
> + * store-release orders store to backup slot addr before
> + * store to per-CPU slot addr.
> + */
> + smp_store_release(&ctx->slot->addr, NULL);
> + }
> +}
> +#else
> +static inline void hazptr_chain_task_ctx(struct hazptr_ctx *ctx) { }
> +static inline void hazptr_unchain_task_ctx(struct hazptr_ctx *ctx) { }
> +static inline void hazptr_note_context_switch(void) { }
> +#endif
> +
> /*
> * hazptr_acquire: Load pointer at address and protect with hazard pointer.
> *
> @@ -114,6 +162,7 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
> struct hazptr_slot *slot = NULL;
> void *addr, *addr2;
>
> + ctx->slot = NULL;
> /*
> * Load @addr_p to know which address should be protected.
> */
> @@ -121,7 +170,9 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
> for (;;) {
> if (!addr)
> return NULL;
> +
> guard(preempt)();
> + hazptr_chain_task_ctx(ctx);
> if (likely(!hazptr_slot_is_backup(ctx, slot))) {
> slot = hazptr_get_free_percpu_slot();
> /*
> @@ -140,8 +191,11 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
> * Re-load @addr_p after storing it to the hazard pointer slot.
> */
> addr2 = READ_ONCE(*addr_p); /* Load A */
> - if (likely(ptr_eq(addr2, addr)))
> + if (likely(ptr_eq(addr2, addr))) {
> + ctx->slot = slot;
> + /* Success. Break loop, enable preemption and return. */
> break;
> + }
> /*
> * If @addr_p content has changed since the first load,
> * release the hazard pointer and try again.
> @@ -150,11 +204,14 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p)
> if (!addr2) {
> if (hazptr_slot_is_backup(ctx, slot))
> hazptr_unchain_backup_slot(ctx);
> + hazptr_unchain_task_ctx(ctx);
> + /* Loaded NULL. Enable preemption and return NULL. */
> return NULL;
> }
> addr = addr2;
> + hazptr_unchain_task_ctx(ctx);
> + /* Enable preemption and retry. */
> }
> - ctx->slot = slot;
> /*
> * Use addr2 loaded from the second READ_ONCE() to preserve
> * address dependency ordering.
> @@ -170,11 +227,13 @@ void hazptr_release(struct hazptr_ctx *ctx, void *addr)
>
> if (!addr)
> return;
> + guard(preempt)();
> slot = ctx->slot;
> WARN_ON_ONCE(slot->addr != addr);
> smp_store_release(&slot->addr, NULL);
> if (unlikely(hazptr_slot_is_backup(ctx, slot)))
> hazptr_unchain_backup_slot(ctx);
> + hazptr_unchain_task_ctx(ctx);
> }
[...]
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch
2025-12-18 22:16 ` Boqun Feng
@ 2025-12-19 0:21 ` Mathieu Desnoyers
0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-19 0:21 UTC (permalink / raw)
To: Boqun Feng
Cc: Joel Fernandes, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 2025-12-18 17:16, Boqun Feng wrote:
[...]
>
> I would suggest we make CONFIG_PREEMPT_HAZPTR always enabled hence no
> need for a config, do we have the measurement of the additional cost?
Removing the PREEMPT_HAZPTR brings read-side performance
from 13.1 down to 5.1 ns. So there is a surprising amount
of work that goes into list_add/list_del.
I just noticed that I was running a kernel with CONFIG_LIST_HARDENED=y.
Reruning refscale for PREEMPT_HAZPTR=y with list hardening disabled
goes from 13.1 ns to 12.4ns, so not a huge win.
I did not notice much difference in terms of scheduler
performance with a quick hackbench run, but I cannot
claim it is an extensive benchmark in any way.
>
> I think you need to add interrupt disabling for chain/unchain because of
> the potential readers in interrupt and then you can avoid the preempt
> disabling in hazptr_release() I think. Let's aim for supporting readers
> in interrupt handler, because at least lockdep needs that.
OK, I'll look into it!
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH v4 0/4] Hazard Pointers
2025-12-18 1:45 [RFC PATCH v4 0/4] Hazard Pointers Mathieu Desnoyers
` (3 preceding siblings ...)
2025-12-18 1:45 ` [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch Mathieu Desnoyers
@ 2025-12-18 10:33 ` Joel Fernandes
2025-12-18 17:54 ` Mathieu Desnoyers
4 siblings, 1 reply; 29+ messages in thread
From: Joel Fernandes @ 2025-12-18 10:33 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm,
Mathieu Desnoyers
Hi Mathieu,
Thanks for posting this.
> On Dec 17, 2025, at 8:45 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> Hi,
>
> Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
> if you guys have time to try it out with your use-cases it would be
> great!
>
> This new version does the following:
>
> - It has 8 preallocated hazard pointer slots per CPU (one cache line),
> - The hazard pointer user allocates a hazard pointer context variable
> (typically on the stack), which contains the pointer to the slot *and*
> a backup slot,
> - When all the per-CPU slots are in use, fallback to the backup slot.
> Chain the backup slot into per-CPU lists, each protected by a raw
> spinlock.
> - The hazard pointer synchronize does a piecewise iteration on the
> per-CPU overflow slots lists, releasing the raw spinlock between
> each list item. It uses a 64-bit generation counter to check for
> concurrent list changes, and restart the traversal on generation
> counter mismatch.
> - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
> the hazard pointer acquire/release adds and then removes the hazard
> pointer context from a per-task linked list. On context switch, the
> scheduler migrates the per-CPU slots used by the task to the backup
> per-context slots, thus making sure the per-CPU slots are not used
> by preempted and blocked tasks.
This last point is another reason why I want the slots to be per task instead of per CPU. It becomes very natural because the hazard pointer is always associated with a task only anyway, not with the CPU (at usecase level). By putting the slot in the task struct, we allow these requirements to flow naturally without requiring any locking or list management.. Did I miss something about the use cases?
I did some measurements about the task-scanning issue, and it is fast in my testing (~1ms/10000 tasks). Any input from you or anyone on what the typical task count distribution is that we are addressing? I also made a rough prototype, and it appears to be simpler with fewer lines of code because I do not need to handle preemption. It just happens naturally.
First of all, we can have a per-task counter that tracks how many hazard pointers are active. If this is zero, then we can simply skip the task instead of wasting cycles scanning all the task slot. Further, we can have a retire list that reuses a single scan to scan all the objects in the retire list, thus reusing the scan cost. This can also assist in asynchronously implementing object retiring via a dedicated thread perhaps with the tasks RCU infrastructure. We can also make this per-task counter a bitmap to speed up scanning potentially.
I am okay with the concept of an overflow list, but if we keep the overflow list at the per-task level instead of the per-CPU level, it is highly unlikely IMO that such an overflow list will be used unless more than, say, eight hazard pointers per task are active at any given time. So its lock contention would be rarer than, say, having a per-CPU overflow list. I would say that contention would be incredibly rare because typically hazard pointers are used by multiple tasks, each of which will have its own unique set of slots. Whereas in a per-CPU overflow approach, we have a higher chance of lock contention, especially when the number of CPUs is low.
Other than the task-scanning performance issue, what am I missing?
Another nice benefit of using per-task hazard pointers is that we can also implement sleeping in hazard pointer sections because we will be scanning for sleeping tasks as well.
By contrast, the other approaches I have seen with per-CPU hazard pointers forbid sleeping, since after sleeping a task is no longer associated with its CPU. The other approaches also have a higher likelihood of locking Due to running out of slots.
Of course I am missing a use case, but I suspect we can find a per-CPU ref-count use case that benefits from this. I am researching use cases when I get time. I think my next task is to find a solid use case for this before doing further development of a solution..
By the way, feedback on the scanning patch:
Can you consider using a per-CPU counter to track the number of active slots per CPU? That way you can ignore CPU slots for CPUs that are not using hazard pointers. Another idea is to skip idle CPUs as well.
Have you also considered any asynchronous use case where maintaining a retired list would assist in RCU-style deferred reclaim of hazard-pointer objects?
thanks,
- Joel
>
> It is based on v6.18.1.
>
> Review is very welcome,
>
> Thanks,
>
> Mathieu
>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: John Stultz <jstultz@google.com>
> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Zqiang <qiang.zhang1211@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: maged.michael@gmail.com
> Cc: Mateusz Guzik <mjguzik@gmail.com>
> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
> Cc: rcu@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: lkmm@lists.linux.dev
>
> Mathieu Desnoyers (4):
> compiler.h: Introduce ptr_eq() to preserve address dependency
> Documentation: RCU: Refer to ptr_eq()
> hazptr: Implement Hazard Pointers
> hazptr: Migrate per-CPU slots to backup slot on context switch
>
> Documentation/RCU/rcu_dereference.rst | 38 +++-
> include/linux/compiler.h | 63 +++++++
> include/linux/hazptr.h | 241 ++++++++++++++++++++++++++
> include/linux/sched.h | 4 +
> init/init_task.c | 3 +
> init/main.c | 2 +
> kernel/Kconfig.preempt | 10 ++
> kernel/Makefile | 2 +-
> kernel/fork.c | 3 +
> kernel/hazptr.c | 150 ++++++++++++++++
> kernel/sched/core.c | 2 +
> 11 files changed, 512 insertions(+), 6 deletions(-)
> create mode 100644 include/linux/hazptr.h
> create mode 100644 kernel/hazptr.c
>
> --
> 2.39.5
>
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH v4 0/4] Hazard Pointers
2025-12-18 10:33 ` [RFC PATCH v4 0/4] Hazard Pointers Joel Fernandes
@ 2025-12-18 17:54 ` Mathieu Desnoyers
0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-12-18 17:54 UTC (permalink / raw)
To: Joel Fernandes
Cc: Boqun Feng, Paul E. McKenney, linux-kernel, Nicholas Piggin,
Michael Ellerman, Greg Kroah-Hartman, Sebastian Andrzej Siewior,
Will Deacon, Peter Zijlstra, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 2025-12-18 05:33, Joel Fernandes wrote:
> Hi Mathieu,
> Thanks for posting this.
>
>> On Dec 17, 2025, at 8:45 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>> Hi,
>>
>> Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
>> if you guys have time to try it out with your use-cases it would be
>> great!
>>
>> This new version does the following:
>>
>> - It has 8 preallocated hazard pointer slots per CPU (one cache line),
>> - The hazard pointer user allocates a hazard pointer context variable
>> (typically on the stack), which contains the pointer to the slot *and*
>> a backup slot,
>> - When all the per-CPU slots are in use, fallback to the backup slot.
>> Chain the backup slot into per-CPU lists, each protected by a raw
>> spinlock.
>> - The hazard pointer synchronize does a piecewise iteration on the
>> per-CPU overflow slots lists, releasing the raw spinlock between
>> each list item. It uses a 64-bit generation counter to check for
>> concurrent list changes, and restart the traversal on generation
>> counter mismatch.
>> - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
>> the hazard pointer acquire/release adds and then removes the hazard
>> pointer context from a per-task linked list. On context switch, the
>> scheduler migrates the per-CPU slots used by the task to the backup
>> per-context slots, thus making sure the per-CPU slots are not used
>> by preempted and blocked tasks.
>
> This last point is another reason why I want the slots to be per task instead of per CPU. It becomes very natural because the hazard pointer is always associated with a task only anyway, not with the CPU (at usecase level). By putting the slot in the task struct, we allow these requirements to flow naturally without requiring any locking or list management.. Did I miss something about the use cases?
I start from the hypothesis that scanning all tasks at synchronize
is likely too slow for a general purpose synchronization algo.
This is why we have RCU (general purpose) tracking hierarchical per-cpu
state, and specialized RCU flavors (for tracing and other use-cases)
using iteration on all tasks.
One of the highlights of hazard pointers over RCU is its ability to
know about quiescence of a pointer without waiting for a whole grace
period. Therefore I went down the route of scanning per-CPU state rather
than per-task.
>
> I did some measurements about the task-scanning issue, and it is fast in my testing (~1ms/10000 tasks). Any input from you or anyone on what the typical task count distribution is that we are addressing? I also made a rough prototype, and it appears to be simpler with fewer lines of code because I do not need to handle preemption. It just happens naturally.
It really depends on the access pattern here. If you want to replace
refcount decrement and test with hazard pointer synchronize, a delay of
1ms on each hazptr synchronize is a *lot*.
Of course perhaps this can be batched with a worker thread, but then
you are back to using more memory due to delayed reclaim, and thus
closer to RCU.
>
> First of all, we can have a per-task counter that tracks how many hazard pointers are active. If this is zero, then we can simply skip the task instead of wasting cycles scanning all the task slot.
> Further, we can have a retire list that reuses a single scan to scan all the objects in the retire list, thus reusing the scan cost. This can also assist in asynchronously implementing object retiring via a dedicated thread perhaps with the tasks RCU infrastructure. We can also make this per-task counter a bitmap to speed up scanning potentially.
>
> I am okay with the concept of an overflow list, but if we keep the overflow list at the per-task level instead of the per-CPU level, it is highly unlikely IMO that such an overflow list will be used unless more than, say, eight hazard pointers per task are active at any given time.
The PREEMPT_HAZPTR scheme achieves this as well, with per-CPU scan
rather than per-task.
> So its lock contention would be rarer than, say, having a per-CPU overflow list. I would say that contention would be incredibly rare because typically hazard pointers are used by multiple tasks, each of which will have its own unique set of slots. Whereas in a per-CPU overflow approach, we have a higher chance of lock contention, especially when the number of CPUs is low.
Contention on the per-CPU spinlocks would only happen if threads are
migrated between chaining of their backup slot (either if 8 hazptr
are in use by the thread, or on context switch), and release. Do
you expect this type of migration to happen so often as to increase
contention ?
>
> Other than the task-scanning performance issue, what am I missing?
Task-scan performance is the key issue. Also you have to set a limit
of per-task slots. How would you handle the scenario where a user
needs more slots than the limit ?
>
> Another nice benefit of using per-task hazard pointers is that we can also implement sleeping in hazard pointer sections because we will be scanning for sleeping tasks as well.
The PREEMPT_HAZPTR scheme takes care of this as well. Sleeping with a
hazptr held will trigger the context switch, and the scheduler will
simply move the hazard pointer from the per-CPU slot to the backup
slot. End of story.
>
> By contrast, the other approaches I have seen with per-CPU hazard pointers forbid sleeping, since after sleeping a task is no longer associated with its CPU. The other approaches also have a higher likelihood of locking Due to running out of slots.
Except for PREEMPT_HAZPTR. :)
>
> Of course I am missing a use case, but I suspect we can find a per-CPU ref-count use case that benefits from this. I am researching use cases when I get time. I think my next task is to find a solid use case for this
before doing further development of a solution..
Let me know what you find, I'm very interested!
>
> By the way, feedback on the scanning patch:
>
> Can you consider using a per-CPU counter to track the number of active slots per CPU? That way you can ignore CPU slots for CPUs that are not using hazard pointers. Another idea is to skip idle CPUs as well.
I had this in a userspace prototype: a counter of used per-cpu slots.
But I finally decided against it because it requires extra instructions
on the fast path, *and* scanning 8 pointers within a single cache line
on synchronize is really cheap. So I'd need benchmarks to justify adding
back the counter.
>
> Have you also considered any asynchronous use case where maintaining a retired list would assist in RCU-style deferred reclaim of hazard-pointer objects?
This would be a logical next step indeed. I'll have to think
about this some more.
Thanks,
Mathieu
>
> thanks,
>
> - Joel
>
>>
>> It is based on v6.18.1.
>>
>> Review is very welcome,
>>
>> Thanks,
>>
>> Mathieu
>>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: "Paul E. McKenney" <paulmck@kernel.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Cc: Alan Stern <stern@rowland.harvard.edu>
>> Cc: John Stultz <jstultz@google.com>
>> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Cc: Frederic Weisbecker <frederic@kernel.org>
>> Cc: Joel Fernandes <joel@joelfernandes.org>
>> Cc: Josh Triplett <josh@joshtriplett.org>
>> Cc: Uladzislau Rezki <urezki@gmail.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
>> Cc: Zqiang <qiang.zhang1211@gmail.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Waiman Long <longman@redhat.com>
>> Cc: Mark Rutland <mark.rutland@arm.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: maged.michael@gmail.com
>> Cc: Mateusz Guzik <mjguzik@gmail.com>
>> Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
>> Cc: rcu@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: lkmm@lists.linux.dev
>>
>> Mathieu Desnoyers (4):
>> compiler.h: Introduce ptr_eq() to preserve address dependency
>> Documentation: RCU: Refer to ptr_eq()
>> hazptr: Implement Hazard Pointers
>> hazptr: Migrate per-CPU slots to backup slot on context switch
>>
>> Documentation/RCU/rcu_dereference.rst | 38 +++-
>> include/linux/compiler.h | 63 +++++++
>> include/linux/hazptr.h | 241 ++++++++++++++++++++++++++
>> include/linux/sched.h | 4 +
>> init/init_task.c | 3 +
>> init/main.c | 2 +
>> kernel/Kconfig.preempt | 10 ++
>> kernel/Makefile | 2 +-
>> kernel/fork.c | 3 +
>> kernel/hazptr.c | 150 ++++++++++++++++
>> kernel/sched/core.c | 2 +
>> 11 files changed, 512 insertions(+), 6 deletions(-)
>> create mode 100644 include/linux/hazptr.h
>> create mode 100644 kernel/hazptr.c
>>
>> --
>> 2.39.5
>>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread