linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Joel Fernandes <joelagnelf@nvidia.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Joel Fernandes <joel@joelfernandes.org>,
	"Paul E. McKenney" <paulmck@kernel.org>
Cc: linux-kernel@vger.kernel.org, Nicholas Piggin <npiggin@gmail.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Will Deacon <will@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Alan Stern <stern@rowland.harvard.edu>,
	John Stultz <jstultz@google.com>,
	Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Zqiang <qiang.zhang1211@gmail.com>,
	Ingo Molnar <mingo@redhat.com>, Waiman Long <longman@redhat.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Vlastimil Babka <vbabka@suse.cz>,
	maged.michael@gmail.com, Mateusz Guzik <mjguzik@gmail.com>,
	Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>,
	rcu@vger.kernel.org, linux-mm@kvack.org, lkmm@lists.linux.dev
Subject: Re: [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers
Date: Thu, 18 Dec 2025 20:22:04 -0500	[thread overview]
Message-ID: <6cbba525-3bcd-4673-a6a5-1fe16330868f@nvidia.com> (raw)
In-Reply-To: <20251218014531.3793471-4-mathieu.desnoyers@efficios.com>

On 12/18/2025 12:54 PM, Mathieu Desnoyers wrote:
[..]
>>> <mathieu.desnoyers@efficios.com> wrote:
[..]
>>> Here is a revisited version of my Hazard Pointers series. Boqun, Joel,
>>> if you guys have time to try it out with your use-cases it would be
>>> great!
>>>
>>> This new version does the following:
>>>
>>> - It has 8 preallocated hazard pointer slots per CPU (one cache line),
>>> - The hazard pointer user allocates a hazard pointer context variable
>>>   (typically on the stack), which contains the pointer to the slot *and*
>>>   a backup slot,
>>> - When all the per-CPU slots are in use, fallback to the backup slot.
>>>   Chain the backup slot into per-CPU lists, each protected by a raw
>>>   spinlock.
>>> - The hazard pointer synchronize does a piecewise iteration on the
>>>   per-CPU overflow slots lists, releasing the raw spinlock between
>>>   each list item. It uses a 64-bit generation counter to check for
>>>   concurrent list changes, and restart the traversal on generation
>>>   counter mismatch.
>>> - There is a new CONFIG_PREEMPT_HAZPTR config option. When enabled,
>>>   the hazard pointer acquire/release adds and then removes the hazard
>>>   pointer context from a per-task linked list. On context switch, the
>>>   scheduler migrates the per-CPU slots used by the task to the backup
>>>   per-context slots, thus making sure the per-CPU slots are not used
>>>   by preempted and blocked tasks.
>>
>> This last point is another reason why I want the slots to be per task instead
>> of per CPU. It becomes very natural because the hazard pointer is always
>> associated with a task only anyway, not with the CPU (at usecase level). By
>> putting the slot in the task struct, we allow these requirements to flow
>> naturally without requiring any locking or list management.. Did I miss
>> something about the use cases?
>
> I start from the hypothesis that scanning all tasks at synchronize
> is likely too slow for a general purpose synchronization algo.
>
> This is why we have RCU (general purpose) tracking hierarchical per-cpu
> state, and specialized RCU flavors (for tracing and other use-cases)
> using iteration on all tasks.
>

I wouldn't call HP as "general" personally, it could be (should be?) usecase
specific. That's why I am not opposed to the per-cpu approach, but I am not
convinced that there is no room for a per-task slot approach which is simpler
in many respects in the code.

If usecase has 1000s of CPUs but few tasks running, then per-task would be way
faster. It would be even cheaper than per-cpu slots on the read-side:
1. Unlikely to overflow relatively speaking, since plenty of slots distributed
to tasks, thus avoiding locking.
2. Preemption also doesn't require saving/restoring and locking.

I would look at per-task HP as a more specific HP implementation if you will and
per-cpu as more leaning towards "general". There's tradeoffs of course.

> One of the highlights of hazard pointers over RCU is its ability to
> know about quiescence of a pointer without waiting for a whole grace
> period. Therefore I went down the route of scanning per-CPU state rather
> than per-task.

Sure, but there could be room for both flavors, no?

>> I did some measurements about the task-scanning issue, and it is fast in my
>> testing (~1ms/10000 tasks). Any input from you or anyone on what the typical
>> task count distribution is that we are addressing? I also made a rough
>> prototype, and it appears to be simpler with fewer lines of code because I do
>> not need to handle preemption. It just happens naturally.
>
> It really depends on the access pattern here. If you want to replace
> refcount decrement and test with hazard pointer synchronize, a delay of
> 1ms on each hazptr synchronize is a *lot*.

Yeah true but I don't think *that* is a fair comparison. The slow path is always
going to be slower than an atomic refcount no matter which HP approach we talk.

I think percpu_ref_kill is a better comparison which triggers a GP:

  percpu_ref_kill()
    - percpu_ref_kill_and_confirm()
       ...
       - call_rcu_hurry(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu)

> Of course perhaps this can be batched with a worker thread, but then
> you are back to using more memory due to delayed reclaim, and thus
> closer to RCU.

If you have millions of objects with per-cpu ref count, that can go into the
megabytes. With HP, you can negate that. So even with batched worker thread,
there can be space savings and fast read-side (at the expense of slightly more
overhead on write side). Sure lockdep usecase does not benefit since Boqun wants
to speed up the write side in that case (synchronize_rcu_expedited) but we
probably don't want to discount other potential HP usecases which want faster
simpler read side while having preemptability, small structures etc and don't
care about update side performance..

>> First of all, we can have a per-task counter that tracks how many hazard
>> pointers are active. If this is zero, then we can simply skip the taskinstead
>> of wasting cycles scanning all the task slot.
>> Further, we can have a retire list that reuses a single scan to scan all the
>> objects in the retire list, thus reusing the scan cost. This can also assist
>> in asynchronously implementing object retiring via a dedicated thread perhaps
>> with the tasks RCU infrastructure. We can also make this per-task counter a
>> bitmap to speed up scanning potentially.
>>
>> I am okay with the concept of an overflow list, but if we keep the overflow
>> list at the per-task level instead of the per-CPU level, it is highlyunlikely
>> IMO that such an overflow list will be used unless more than, say, eight
>> hazard pointers per task are active at any given time.
>
> The PREEMPT_HAZPTR scheme achieves this as well, with per-CPU scan
> rather than per-task.

Sure, but with additional complexity and locking that per-task slot does not at
all have.

>
>> So its lock contention would be rarer than, say, having a per-CPU overflow
>> list. I would say that contention would be incredibly rare because typically
>> hazard pointers are used by multiple tasks, each of which will have its own
>> unique set of slots. Whereas in a per-CPU overflow  approach, we have ahigher
>> chance of lock contention, especially when the number of CPUs is low.
>
> Contention on the per-CPU spinlocks would only happen if threads are
> migrated between chaining of their backup slot (either if 8 hazptr
> are in use by the thread, or on context switch), and release. Do
> you expect this type of migration to happen so often as to increase
> contention ?

Is that really true? You have contention also when you exhaust all per-cpu slots
not just on migration:

+			/*
+			 * If all the per-CPU slots are already in use, fallback
+			 * to the backup slot.
+			 */
+			if (unlikely(!slot))
+				slot = hazptr_chain_backup_slot(ctx);

>
>>
>> Other than the task-scanning performance issue, what am I missing?
>
> Task-scan performance is the key issue. Also you have to set a limit
> of per-task slots. How would you handle the scenario where a user
> needs more slots than the limit ?

Sure there is a limit but I wager that this limit is harder to hit than with the
per-cpu case. If you have smaller number of CPUs, you might be hitting this all
the time with per-cpu approach. I'd have a overflow node in task_struct to
handle this. Billions of Linux systems have 8-16 CPUs.

>> Another nice benefit of using per-task hazard pointers is that we can also
>> implement sleeping in hazard pointer sections because we will be scanning for
>> sleeping tasks as well.
>
> The PREEMPT_HAZPTR scheme takes care of this as well. Sleeping with a
> hazptr held will trigger the context switch, and the scheduler will
> simply move the hazard pointer from the per-CPU slot to the backup
> slot. End of story.

Yeah, but the per-task approach does not need that at all. In fact the hazard
pointer context structure is smaller.

>> By contrast, the other approaches I have seen with per-CPU hazard pointers
>> forbid sleeping, since after sleeping a task is no longer associated with its
>> CPU. The other approaches also have a higher likelihood of locking Due to
>> running out of slots.
>
> Except for PREEMPT_HAZPTR. :)

Ah you mean, because you make a copy of the slots into the task. But why not
just keep it in the task to begin with :-D (your points on task scan latency is
valid though..).

>> Of course I am missing a use case, but I suspect we can find a per-CPU ref-
>> count use case that benefits from this. I am researching use cases when I get
>> time. I think my next task is to find a solid use case for this
> before doing further development of a solution..
>
> Let me know what you find, I'm very interested!

Sure :) will do, thanks.

>> By the way, feedback on the scanning patch:
>>
>> Can you consider using a per-CPU counter to track the number of active slots
>> per CPU? That way you can ignore CPU slots for CPUs that are not using hazard
>> pointers. Another idea is to skip idle CPUs as well.
>
> I had this in a userspace prototype: a counter of used per-cpu slots.
> But I finally decided against it because it requires extra instructions
> on the fast path, *and* scanning 8 pointers within a single cache line
> on synchronize is really cheap. So I'd need benchmarks to justify adding
> back the counter.

Interesting, makes sense.

>> Have you also considered any asynchronous use case where maintaining a
>> retired  list would assist in RCU-style deferred reclaim of hazard-pointer
objects?
>
> This would be a logical next step indeed. I'll have to think
> about this some more.
>

You might be able to re-use some of the RCU async processing machinery for that.
Or maybe using timers/workqueues.

Thanks.



  parent reply	other threads:[~2025-12-19  1:22 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-18  1:45 [RFC PATCH v4 0/4] " Mathieu Desnoyers
2025-12-18  1:45 ` [RFC PATCH v4 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency Mathieu Desnoyers
2025-12-18  9:03   ` David Laight
2025-12-18 13:51     ` Mathieu Desnoyers
2025-12-18 15:54       ` David Laight
2025-12-18 14:27     ` Gary Guo
2025-12-18 16:12       ` David Laight
2025-12-18  1:45 ` [RFC PATCH v4 2/4] Documentation: RCU: Refer to ptr_eq() Mathieu Desnoyers
2025-12-18  1:45 ` [RFC PATCH v4 3/4] hazptr: Implement Hazard Pointers Mathieu Desnoyers
2025-12-18  8:36   ` Boqun Feng
2025-12-18 17:35     ` Mathieu Desnoyers
2025-12-18 20:22       ` Boqun Feng
2025-12-18 23:36         ` Mathieu Desnoyers
2025-12-19  0:25           ` Boqun Feng
2025-12-19  6:06             ` Joel Fernandes
2025-12-19 15:14             ` Mathieu Desnoyers
2025-12-19 15:42               ` Joel Fernandes
2025-12-19 22:19                 ` Mathieu Desnoyers
2025-12-19 22:39                   ` Joel Fernandes
2025-12-21  9:59                     ` Boqun Feng
2025-12-19  0:43       ` Boqun Feng
2025-12-19 14:22         ` Mathieu Desnoyers
2025-12-19  1:22   ` Joel Fernandes [this message]
2025-12-18  1:45 ` [RFC PATCH v4 4/4] hazptr: Migrate per-CPU slots to backup slot on context switch Mathieu Desnoyers
2025-12-18 16:20   ` Mathieu Desnoyers
2025-12-18 22:16   ` Boqun Feng
2025-12-19  0:21     ` Mathieu Desnoyers
2025-12-18 10:33 ` [RFC PATCH v4 0/4] Hazard Pointers Joel Fernandes
2025-12-18 17:54   ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6cbba525-3bcd-4673-a6a5-1fe16330868f@nvidia.com \
    --to=joelagnelf@nvidia.com \
    --cc=Neeraj.Upadhyay@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bigeasy@linutronix.de \
    --cc=boqun.feng@gmail.com \
    --cc=frederic@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=jonas.oberhauser@huaweicloud.com \
    --cc=josh@joshtriplett.org \
    --cc=jstultz@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lkmm@lists.linux.dev \
    --cc=longman@redhat.com \
    --cc=maged.michael@gmail.com \
    --cc=mark.rutland@arm.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qiang.zhang1211@gmail.com \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=stern@rowland.harvard.edu \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=urezki@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox