[Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs

ksummit.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
@ 2014-05-07 18:29 Will Deacon
  2014-05-07 19:12 ` Peter Zijlstra
  2014-05-07 21:17 ` Waiman Long
  0 siblings, 2 replies; 12+ messages in thread
From: Will Deacon @ 2014-05-07 18:29 UTC (permalink / raw)
  To: ksummit-discuss; +Cc: peterz, waiman.long

Hi all,

Traditionally, low-level synchronisation and atomic constructs have been
buried away in arch-specific code, with new arch maintainers having to
wrestle with Documentation/{memory-barriers,atomic_ops}.txt to ensure
they provide the (somewhat arbitrary) semantics expected by the kernel.

However, over the past year, there have been some notable events in this
area:

  (1) The addition of smp_load_acquire/smp_store_release across all
      architectures (including asm-generic)
      https://lwn.net/Articles/576486/

  (2) Merging of generic MCS spinlocks into kernel/locking, built using
      the macros introduced by (1). There are other similar patches for
      queued spinlocks and rwlocks, but they're not completely generic
      afaict.
      http://lwn.net/Articles/590243/

  (3) C11 atomics seem to be more prominent on people's radar.
      http://lwn.net/Articles/586838/

In light of these developments, I'd like to revisit what we expect from
architectures in the way of atomics and locks. Ultimately, this means
trying to construct efficient (not just functional!) spinlock, rwlock,
futex and bitop implementations in asm-generic which would be build out
of suitable, simple(r) atomic operations implemented by each architecture.

For example, a dumb spinlock can be constructed with our [cmp]xchg:

  dumb_spin_lock(atomic_t *lock)
  {
	while (atomic_cmpxchg(lock, 0, 1))
		cpu_relax();
  }

  dumb_spin_unlock(atomic_t *lock)
  {
	atomic_xchg(lock, 0);
  }

but nobody is going to flock to such an implementation. It's unfair and
has additional barrier overhead on weakly-ordered architectures.

What we'd like to write is something like (just trying to show the atomics
-- this is likely not endian-clean at least):

#define TICKET_SHIFT	16
#define TICKET_MASK	((1 << TICKET_SHIFT) - 1)

  better_spin_lock(atomic_t *lock)
  {
	/*
	 * Atomic add to lock with acquire semantics, returning original
	 * value.
	 */
	int old = atomic_xchg_add(ACQUIRE, lock, 1 << TICKET_SHIFT);
	if ((old << TICKET_SHIFT) == (old & (TICKET_MASK << TICKET_SHIFT)))
		return; /* Got the lock */

	while (smp_load_acquire((u16 *)lock) != (old >> TICKET_SHIFT))
		cpu_relax();
  }

  better_spin_unlock(atomic_t *lock)
  {
	smp_store_release((u16 *)lock, atomic_read(lock) + 1);
  }

which isn't perfect, but is much better than before and made possible by
the addition of atomic_xchg_add. The remaining parts to address are things
like power saving when spinning and synchronising unlock with I/O accesses,
but they will likely remain arch-specific. We should be able to do a similar
thing for other locking constructs.

Increasing the number of atomic operations would also help us to define
a cleaner interface to them, potentially easing any adoption of C11 atomics
in future. For example, rather than saying "atomic_set() does not imply
barriers" and "atomic_xchg() requires explicit memory barriers around the
operation", we could have store/release/store+release variants of them both
(similar to C11 and OpenCL). As it stands, there are a couple of cases
where atomic_xchg is used in preference to atomic_set + appropriate barriers
(just look for uses of xchg where the return value is ignored).

Another related area to clean up is the semantics around conditional
atomics (cmpxchg, add_unless, dec_if_positive etc) and barriers on the
failure paths.

Anyway, I've proposed this as a [TECH TOPIC] because I think sitting down
for half a day with other people interested in this problem could result
in us deciding on an initial set of atomic operations from which to build
efficient, generic locking constructs. That later part can definitely
happen on LKML, but the first steps will be a lot easier face-to-face.

Key people for this would be:

  - Peter Zijlstra
  - Ben Herrenschmidt
  - Paul McKenney
  - Waiman Long
  - Arnd Bergmann

Cheers,

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-07 18:29 [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs Will Deacon
@ 2014-05-07 19:12 ` Peter Zijlstra
  2014-05-07 21:20   ` Will Deacon
  2014-05-07 21:17 ` Waiman Long
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2014-05-07 19:12 UTC (permalink / raw)
  To: Will Deacon; +Cc: waiman.long, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]

On Wed, May 07, 2014 at 07:29:16PM +0100, Will Deacon wrote:
>   better_spin_lock(atomic_t *lock)
>   {
> 	/*
> 	 * Atomic add to lock with acquire semantics, returning original
> 	 * value.
> 	 */
> 	int old = atomic_xchg_add(ACQUIRE, lock, 1 << TICKET_SHIFT);
> 	if ((old << TICKET_SHIFT) == (old & (TICKET_MASK << TICKET_SHIFT)))
> 		return; /* Got the lock */
> 
> 	while (smp_load_acquire((u16 *)lock) != (old >> TICKET_SHIFT))
> 		cpu_relax();
>   }

So xchg_add and atomic_add_return() are pretty much an identity map
because the add operation is reversible. The other popular name for this
operation is fetch_add() fwiw.

In any case, something that's been brewing in the back of my mind is an
ATOMIC_OP() and ATOMIC_RET_OP() macro construct that takes a lambda
function (expr-stmt is I think the closes we get in C) and either
generates the appropriate ll/sc loop or a cmpxchg loop, depending on
arch.

Its fairly sad that many of the generic atomic operations we do with
cmpxchg loops where ll/sc archs can do better using their ll/sc
primitive and I just feel there must be a semi-sane way to express all
that, despite C :-)

</rant>

That also reminds me, I should send round 2 of the arch atomic cleanups
(I've got the patches), and write changelogs for round 3, and start on
the patches for round 4 :-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-07 19:12 ` Peter Zijlstra
@ 2014-05-07 21:20   ` Will Deacon
  2014-05-08  9:13     ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Will Deacon @ 2014-05-07 21:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: waiman.long, ksummit-discuss

Hi Peter,

On Wed, May 07, 2014 at 08:12:08PM +0100, Peter Zijlstra wrote:
> On Wed, May 07, 2014 at 07:29:16PM +0100, Will Deacon wrote:
> >   better_spin_lock(atomic_t *lock)
> >   {
> > 	/*
> > 	 * Atomic add to lock with acquire semantics, returning original
> > 	 * value.
> > 	 */
> > 	int old = atomic_xchg_add(ACQUIRE, lock, 1 << TICKET_SHIFT);
> > 	if ((old << TICKET_SHIFT) == (old & (TICKET_MASK << TICKET_SHIFT)))
> > 		return; /* Got the lock */
> > 
> > 	while (smp_load_acquire((u16 *)lock) != (old >> TICKET_SHIFT))
> > 		cpu_relax();
> >   }
> 
> So xchg_add and atomic_add_return() are pretty much an identity map
> because the add operation is reversible. The other popular name for this
> operation is fetch_add() fwiw.

Yup, and we could implement atomic_xchg_add (or atomic_fetch_add :) in
asm-generic using atomic_add_return followed by a subtraction. However,
architectures might be able to avoid the sub (especially on LL/SC
implementations like ARM).

> In any case, something that's been brewing in the back of my mind is an
> ATOMIC_OP() and ATOMIC_RET_OP() macro construct that takes a lambda
> function (expr-stmt is I think the closes we get in C) and either
> generates the appropriate ll/sc loop or a cmpxchg loop, depending on
> arch.

I've been thinking along the same lines but decided it was a bit too
abstract to propose here. I'd certainly be interested in talking about it
though. Another cool thing would be to allow for arbitrary compositions of
different atomic operations, then apply barrier semantics to the whole lot.
Not sure how much mileage there is in that though, especially given the
typical architectural restrictions on what you can in a LL/SC loop (and
if they get too big, you shoot yourself in the foot).

> Its fairly sad that many of the generic atomic operations we do with
> cmpxchg loops where ll/sc archs can do better using their ll/sc
> primitive and I just feel there must be a semi-sane way to express all
> that, despite C :-)

Hehe, yeah. It's like trying to implement a functional programming language
in the preprocessor...

> That also reminds me, I should send round 2 of the arch atomic cleanups
> (I've got the patches), and write changelogs for round 3, and start on
> the patches for round 4 :-)

Excellent! Please stick me on CC when you send them out.

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-07 21:20   ` Will Deacon
@ 2014-05-08  9:13     ` Peter Zijlstra
  2014-05-08 14:27       ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2014-05-08  9:13 UTC (permalink / raw)
  To: Will Deacon; +Cc: waiman.long, ksummit-discuss

On Wed, May 07, 2014 at 10:20:01PM +0100, Will Deacon wrote:
> > In any case, something that's been brewing in the back of my mind is an
> > ATOMIC_OP() and ATOMIC_RET_OP() macro construct that takes a lambda
> > function (expr-stmt is I think the closes we get in C) and either
> > generates the appropriate ll/sc loop or a cmpxchg loop, depending on
> > arch.
> 
> I've been thinking along the same lines but decided it was a bit too
> abstract to propose here. I'd certainly be interested in talking about it
> though. Another cool thing would be to allow for arbitrary compositions of
> different atomic operations, then apply barrier semantics to the whole lot.
> Not sure how much mileage there is in that though, especially given the
> typical architectural restrictions on what you can in a LL/SC loop (and
> if they get too big, you shoot yourself in the foot).

OK, so I was bored in a waiting room..

So I've not yet had a look at all the arch ll/sc loop restrictions, for
some I'm sure the below will not work, but I'm hoping that for some
others it at least has a chance.

(also, waiting rooms suck..)

More or less Pseudo C, the ATOMIC things should be proper macros but I
was too lazy to do all the \ muck.

---

#ifndef load_exclusive
#define load_exclusive(ptr) ACCESS_ONCE(*ptr)
#endif

#ifndef	cmpxchg_relaxed
#define cmpxchg_relaxed	cmpxchg
#endif

/*
 * The 'stmt' statements below must include a statement of the form:
 *   __new == f(__val);
 * which computes the new value from the current/old value.
 *
 * The __ret argument should be either __new or __val, to return the new or old
 * value resp.
 */

#ifdef HAS_LL_SC

ATOMIC(ptr, stmt)
do {
	typeof(*ptr) __new, __val;

	do {
		__val = load_locked(ptr);
		stmt;
	} while (!store_conditional(ptr, __new));
} while (0)


ATOMIC_RET(ptr, __ret, stmt)
({
	typeof(*ptr) __new, __val;

	smp_mb__before_llsc();

	do {
		__val = load_locked(ptr);
		stmt;
	} while (!store_conditional(ptr, __new));

	smp_mb__after_llsc();

	__ret;
})

#else

ATOMIC(ptr, stmt)
do {
	typeof(*ptr) __old, __new, __val;

	__val = load_exclusive(ptr);
	for (;;) {
		stmt;
		__old = cmpxchg_relaxed(ptr, __val, __new);
		if (__old == __val)
			break;
		__val = __old;
	}
} while (0)

ATOMIC(ptr, __ret, stmt)
({
	typeof(*ptr) __old, __new, __val;

	__val = load_exclusive(ptr);
	for (;;) {
		stmt;
		__old = cmpxchg(ptr, __val, __new);
		if (__old == __val)
			break;
		__val = __old;
	}

	__ret;
})

#endif


static inline int atomic_add_unless(atomic_t *v, int a, int u)
{
	return ATOMIC_RET(&v->counter, __old,
		if (unlikely(__val == u))
			break;
		__new = __val + a;
	);
}


And this also raises your other point, what barrier does/should
add_unless() imply in the failure case. The cmpxchg() variant won't in
fact guarantee any barrier, while the ll/sc one depends on the arch.

Also, maybe, we should take this discussion elsewhere.. :-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-08  9:13     ` Peter Zijlstra
@ 2014-05-08 14:27       ` Peter Zijlstra
  2014-05-08 14:43         ` David Woodhouse
  2014-05-08 15:13         ` Will Deacon
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2014-05-08 14:27 UTC (permalink / raw)
  To: Will Deacon; +Cc: waiman.long, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1410 bytes --]

On Thu, May 08, 2014 at 11:13:12AM +0200, Peter Zijlstra wrote:
> ATOMIC_RET(ptr, __ret, stmt)
> ({
> 	typeof(*ptr) __new, __val;
> 
> 	smp_mb__before_llsc();
> 
> 	do {
> 		__val = load_locked(ptr);
> 		stmt;
> 	} while (!store_conditional(ptr, __new));
> 
> 	smp_mb__after_llsc();
> 
> 	__ret;
> })

So the most common constraint (which you've confirmed is true for ARM as
well) is that we should not have memory accesses in between an LL/SC.

Making sure GCC doesn't do any is tricky, the best I can come up with is
tagging all variables with the register qualifier, like:

ATOMIC_RET(ptr, __ret, stmt)
({
	register typeof(*ptr) __new, __val;

	smp_mb__before_llsc();

	do {
		__val = load_locked(ptr);
		stmt;
	} while (!store_conditional(ptr, __new));

	smp_mb__after_llsc();

	__ret;
})

Now, I'm not at all sure if register still means anything to GCC, but in
the faint hope that it still sees it as a hint this might just work.

> static inline int atomic_add_unless(atomic_t *v, int a, int u)
> {
> 	return ATOMIC_RET(&v->counter, __old,
> 		if (unlikely(__val == u))
> 			break;
> 		__new = __val + a;
> 	);
> }

And that would then become:

static inline
int atomic_add_unless(register atomic_t *v, register int a, register int u)
{
	return ATOMIC_RET(&v->counter, __val,
		if (unlikely(__val == u))
			break;
		__new = __val + a;
	);
}

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-08 14:27       ` Peter Zijlstra
@ 2014-05-08 14:43         ` David Woodhouse
  2014-05-08 15:13         ` Will Deacon
  1 sibling, 0 replies; 12+ messages in thread
From: David Woodhouse @ 2014-05-08 14:43 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: waiman.long, ksummit-discuss

[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]

On Thu, 2014-05-08 at 16:27 +0200, Peter Zijlstra wrote:
> 
> So the most common constraint (which you've confirmed is true for ARM as
> well) is that we should not have memory accesses in between an LL/SC.
> 
> Making sure GCC doesn't do any is tricky, the best I can come up with is
> tagging all variables with the register qualifier, like:
> 
> ATOMIC_RET(ptr, __ret, stmt)
> ({
>         register typeof(*ptr) __new, __val;
> 
>         smp_mb__before_llsc();
> 
>         do {
>                 __val = load_locked(ptr);
>                 stmt;
>         } while (!store_conditional(ptr, __new));
> 
>         smp_mb__after_llsc();
> 
>         __ret;
> })
> 
> Now, I'm not at all sure if register still means anything to GCC, but in
> the faint hope that it still sees it as a hint this might just work.

No. Please no. Not even if you sacrifice a goat to the gods of GCC
undefined behaviour.

Acting on condition flags from inline asm is also hard or suboptimal in
some cases too. You're probably better off just having this done in asm
directly.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-08 14:27       ` Peter Zijlstra
  2014-05-08 14:43         ` David Woodhouse
@ 2014-05-08 15:13         ` Will Deacon
  2014-05-08 16:39           ` Paul E. McKenney
  1 sibling, 1 reply; 12+ messages in thread
From: Will Deacon @ 2014-05-08 15:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: waiman.long, ksummit-discuss

On Thu, May 08, 2014 at 03:27:34PM +0100, Peter Zijlstra wrote:
> On Thu, May 08, 2014 at 11:13:12AM +0200, Peter Zijlstra wrote:
> > ATOMIC_RET(ptr, __ret, stmt)
> > ({
> > 	typeof(*ptr) __new, __val;
> > 
> > 	smp_mb__before_llsc();
> > 
> > 	do {
> > 		__val = load_locked(ptr);
> > 		stmt;
> > 	} while (!store_conditional(ptr, __new));
> > 
> > 	smp_mb__after_llsc();
> > 
> > 	__ret;
> > })
> 
> So the most common constraint (which you've confirmed is true for ARM as
> well) is that we should not have memory accesses in between an LL/SC.

Yup.

> Making sure GCC doesn't do any is tricky, the best I can come up with is
> tagging all variables with the register qualifier, like:
> 
> ATOMIC_RET(ptr, __ret, stmt)
> ({
> 	register typeof(*ptr) __new, __val;
> 
> 	smp_mb__before_llsc();
> 
> 	do {
> 		__val = load_locked(ptr);
> 		stmt;
> 	} while (!store_conditional(ptr, __new));
> 
> 	smp_mb__after_llsc();
> 
> 	__ret;
> })
> 
> Now, I'm not at all sure if register still means anything to GCC, but in
> the faint hope that it still sees it as a hint this might just work.

I just ran this past our compiler guys and they threw rocks at me. Even if
it happens to work today, I don't think we can rely on it tomorrow,
unfortunately.

I think that makes the case for extended the fine-grained atomics
implemented by each architecture, but it would still be great to have a way
to compose them.

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-08 15:13         ` Will Deacon
@ 2014-05-08 16:39           ` Paul E. McKenney
  0 siblings, 0 replies; 12+ messages in thread
From: Paul E. McKenney @ 2014-05-08 16:39 UTC (permalink / raw)
  To: Will Deacon; +Cc: Peter Zijlstra, waiman.long, ksummit-discuss

On Thu, May 08, 2014 at 04:13:31PM +0100, Will Deacon wrote:
> On Thu, May 08, 2014 at 03:27:34PM +0100, Peter Zijlstra wrote:
> > On Thu, May 08, 2014 at 11:13:12AM +0200, Peter Zijlstra wrote:
> > > ATOMIC_RET(ptr, __ret, stmt)
> > > ({
> > > 	typeof(*ptr) __new, __val;
> > > 
> > > 	smp_mb__before_llsc();
> > > 
> > > 	do {
> > > 		__val = load_locked(ptr);
> > > 		stmt;
> > > 	} while (!store_conditional(ptr, __new));
> > > 
> > > 	smp_mb__after_llsc();
> > > 
> > > 	__ret;
> > > })
> > 
> > So the most common constraint (which you've confirmed is true for ARM as
> > well) is that we should not have memory accesses in between an LL/SC.
> 
> Yup.
> 
> > Making sure GCC doesn't do any is tricky, the best I can come up with is
> > tagging all variables with the register qualifier, like:
> > 
> > ATOMIC_RET(ptr, __ret, stmt)
> > ({
> > 	register typeof(*ptr) __new, __val;
> > 
> > 	smp_mb__before_llsc();
> > 
> > 	do {
> > 		__val = load_locked(ptr);
> > 		stmt;
> > 	} while (!store_conditional(ptr, __new));
> > 
> > 	smp_mb__after_llsc();
> > 
> > 	__ret;
> > })
> > 
> > Now, I'm not at all sure if register still means anything to GCC, but in
> > the faint hope that it still sees it as a hint this might just work.
> 
> I just ran this past our compiler guys and they threw rocks at me. Even if
> it happens to work today, I don't think we can rely on it tomorrow,
> unfortunately.
> 
> I think that makes the case for extended the fine-grained atomics
> implemented by each architecture, but it would still be great to have a way
> to compose them.

I suppose that we could attempt to apply the same structure to the asms,
but it is not clear that this would be a win.  There is some benefit to
having the assembly for each primitive laid out in one place, even if
it does mean quite a bit of duplicate code.

But yes, it sure feels like there should be some better way to do this...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-07 18:29 [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs Will Deacon
  2014-05-07 19:12 ` Peter Zijlstra
@ 2014-05-07 21:17 ` Waiman Long
  2014-05-07 21:26   ` Peter Zijlstra
  2014-05-08 14:16   ` Will Deacon
  1 sibling, 2 replies; 12+ messages in thread
From: Waiman Long @ 2014-05-07 21:17 UTC (permalink / raw)
  To: Will Deacon; +Cc: peterz, ksummit-discuss

On 05/07/2014 02:29 PM, Will Deacon wrote:
> Hi all,
>
> Traditionally, low-level synchronisation and atomic constructs have been
> buried away in arch-specific code, with new arch maintainers having to
> wrestle with Documentation/{memory-barriers,atomic_ops}.txt to ensure
> they provide the (somewhat arbitrary) semantics expected by the kernel.
>
> However, over the past year, there have been some notable events in this
> area:
>
>    (1) The addition of smp_load_acquire/smp_store_release across all
>        architectures (including asm-generic)
>        https://lwn.net/Articles/576486/
>
>    (2) Merging of generic MCS spinlocks into kernel/locking, built using
>        the macros introduced by (1). There are other similar patches for
>        queued spinlocks and rwlocks, but they're not completely generic
>        afaict.
>        http://lwn.net/Articles/590243/

It is true that the current qspinlock patch is not completely generic. 
However, I think it can still be used by most architectures with the 
exception of, perhaps just, the pre-EV56 alpha.

>    (3) C11 atomics seem to be more prominent on people's radar.
>        http://lwn.net/Articles/586838/
>
> In light of these developments, I'd like to revisit what we expect from
> architectures in the way of atomics and locks. Ultimately, this means
> trying to construct efficient (not just functional!) spinlock, rwlock,
> futex and bitop implementations in asm-generic which would be build out
> of suitable, simple(r) atomic operations implemented by each architecture.
>
> For example, a dumb spinlock can be constructed with our [cmp]xchg:
>
>    dumb_spin_lock(atomic_t *lock)
>    {
> 	while (atomic_cmpxchg(lock, 0, 1))
> 		cpu_relax();
>    }
>
>    dumb_spin_unlock(atomic_t *lock)
>    {
> 	atomic_xchg(lock, 0);
>    }
>
> but nobody is going to flock to such an implementation. It's unfair and
> has additional barrier overhead on weakly-ordered architectures.
>
> What we'd like to write is something like (just trying to show the atomics
> -- this is likely not endian-clean at least):
>
> #define TICKET_SHIFT	16
> #define TICKET_MASK	((1<<  TICKET_SHIFT) - 1)
>
>    better_spin_lock(atomic_t *lock)
>    {
> 	/*
> 	 * Atomic add to lock with acquire semantics, returning original
> 	 * value.
> 	 */
> 	int old = atomic_xchg_add(ACQUIRE, lock, 1<<  TICKET_SHIFT);
> 	if ((old<<  TICKET_SHIFT) == (old&  (TICKET_MASK<<  TICKET_SHIFT)))
> 		return; /* Got the lock */
>
> 	while (smp_load_acquire((u16 *)lock) != (old>>  TICKET_SHIFT))
> 		cpu_relax();
>    }
>
>    better_spin_unlock(atomic_t *lock)
>    {
> 	smp_store_release((u16 *)lock, atomic_read(lock) + 1);
>    }

There is a problem that the current smp_store_release() supports only an 
int or long data types. So the compilation will fail for a short or char 
data types. It will make thing easier if short and char are also 
supported, but that won't work in architectures like pre-EV56 alphas 
which has no native 8 and 16 bits load/store instructions. Perhaps we 
should think about retiring Linux support for those architectures.

> which isn't perfect, but is much better than before and made possible by
> the addition of atomic_xchg_add. The remaining parts to address are things

An atomic_xchg_add can be emulated by (atomic_add_return - addend).

> like power saving when spinning and synchronising unlock with I/O accesses,
> but they will likely remain arch-specific. We should be able to do a similar
> thing for other locking constructs.
>
> Increasing the number of atomic operations would also help us to define
> a cleaner interface to them, potentially easing any adoption of C11 atomics
> in future. For example, rather than saying "atomic_set() does not imply
> barriers" and "atomic_xchg() requires explicit memory barriers around the
> operation", we could have store/release/store+release variants of them both
> (similar to C11 and OpenCL). As it stands, there are a couple of cases
> where atomic_xchg is used in preference to atomic_set + appropriate barriers
> (just look for uses of xchg where the return value is ignored).
>
> Another related area to clean up is the semantics around conditional
> atomics (cmpxchg, add_unless, dec_if_positive etc) and barriers on the
> failure paths.

The cmpxchg() function (xchg to a lesser extent) is frequently used in 
synchronization primitives. If there are less costly instructions in 
other architectures that implement that with just the acq or rel 
semantics, we may consider adding the _acq() and _rel() versions.

-Longman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-07 21:17 ` Waiman Long
@ 2014-05-07 21:26   ` Peter Zijlstra
  2014-05-07 22:29     ` Waiman Long
  2014-05-08 14:16   ` Will Deacon
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2014-05-07 21:26 UTC (permalink / raw)
  To: Waiman Long; +Cc: ksummit-discuss

On Wed, May 07, 2014 at 05:17:48PM -0400, Waiman Long wrote:
> On 05/07/2014 02:29 PM, Will Deacon wrote:
> >Hi all,
> >
> >Traditionally, low-level synchronisation and atomic constructs have been
> >buried away in arch-specific code, with new arch maintainers having to
> >wrestle with Documentation/{memory-barriers,atomic_ops}.txt to ensure
> >they provide the (somewhat arbitrary) semantics expected by the kernel.
> >
> >However, over the past year, there have been some notable events in this
> >area:
> >
> >   (1) The addition of smp_load_acquire/smp_store_release across all
> >       architectures (including asm-generic)
> >       https://lwn.net/Articles/576486/
> >
> >   (2) Merging of generic MCS spinlocks into kernel/locking, built using
> >       the macros introduced by (1). There are other similar patches for
> >       queued spinlocks and rwlocks, but they're not completely generic
> >       afaict.
> >       http://lwn.net/Articles/590243/
> 
> It is true that the current qspinlock patch is not completely generic.
> However, I think it can still be used by most architectures with the
> exception of, perhaps just, the pre-EV56 alpha.

Its going to be a massive pain on virt archs like PPC and s390. So while
those archs don't suffer the same problem Alpha does, they have problem
with fair locks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-07 21:26   ` Peter Zijlstra
@ 2014-05-07 22:29     ` Waiman Long
  0 siblings, 0 replies; 12+ messages in thread
From: Waiman Long @ 2014-05-07 22:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: ksummit-discuss

On 05/07/2014 05:26 PM, Peter Zijlstra wrote:
> On Wed, May 07, 2014 at 05:17:48PM -0400, Waiman Long wrote:
>> On 05/07/2014 02:29 PM, Will Deacon wrote:
>>> Hi all,
>>>
>>> Traditionally, low-level synchronisation and atomic constructs have been
>>> buried away in arch-specific code, with new arch maintainers having to
>>> wrestle with Documentation/{memory-barriers,atomic_ops}.txt to ensure
>>> they provide the (somewhat arbitrary) semantics expected by the kernel.
>>>
>>> However, over the past year, there have been some notable events in this
>>> area:
>>>
>>>    (1) The addition of smp_load_acquire/smp_store_release across all
>>>        architectures (including asm-generic)
>>>        https://lwn.net/Articles/576486/
>>>
>>>    (2) Merging of generic MCS spinlocks into kernel/locking, built using
>>>        the macros introduced by (1). There are other similar patches for
>>>        queued spinlocks and rwlocks, but they're not completely generic
>>>        afaict.
>>>        http://lwn.net/Articles/590243/
>> It is true that the current qspinlock patch is not completely generic.
>> However, I think it can still be used by most architectures with the
>> exception of, perhaps just, the pre-EV56 alpha.
> Its going to be a massive pain on virt archs like PPC and s390. So while
> those archs don't suffer the same problem Alpha does, they have problem
> with fair locks.

The qspinlock patch does support unfair lock. What an architecture needs 
to do is to define PARAVIRT_UNFAIR_LOCKS and map 
paravirt_unfairlocks_enabled to true.

-Longman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs
  2014-05-07 21:17 ` Waiman Long
  2014-05-07 21:26   ` Peter Zijlstra
@ 2014-05-08 14:16   ` Will Deacon
  1 sibling, 0 replies; 12+ messages in thread
From: Will Deacon @ 2014-05-08 14:16 UTC (permalink / raw)
  To: Waiman Long; +Cc: peterz, ksummit-discuss

On Wed, May 07, 2014 at 10:17:48PM +0100, Waiman Long wrote:
> On 05/07/2014 02:29 PM, Will Deacon wrote:
> >    better_spin_unlock(atomic_t *lock)
> >    {
> > 	smp_store_release((u16 *)lock, atomic_read(lock) + 1);
> >    }
> 
> There is a problem that the current smp_store_release() supports only an 
> int or long data types. So the compilation will fail for a short or char 
> data types. It will make thing easier if short and char are also 
> supported, but that won't work in architectures like pre-EV56 alphas 
> which has no native 8 and 16 bits load/store instructions. Perhaps we 
> should think about retiring Linux support for those architectures.

I confess to having some Alphas from that timeframe, and it would be sad to
see them go :(

Despite that, if we could define some acquire/release accessors to work on a
bunch of native types then we could actually build atomic_t and atomic64_t
operations on top of those. So atomic_read(ACQUIRE, a) would be an
smp_load_acquire(a), where the latter can be used for u8/u16/u32/u64 too.

> > Another related area to clean up is the semantics around conditional
> > atomics (cmpxchg, add_unless, dec_if_positive etc) and barriers on the
> > failure paths.
> 
> The cmpxchg() function (xchg to a lesser extent) is frequently used in 
> synchronization primitives. If there are less costly instructions in 
> other architectures that implement that with just the acq or rel 
> semantics, we may consider adding the _acq() and _rel() versions.

I think Peter's approach of treating LL/SC and cmpxchg() as two different
ways to build other atomics probably works best here.

Will

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-05-08 16:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-07 18:29 [Ksummit-discuss] [TECH TOPIC] asm-generic implementations of low-level synchronisation constructs Will Deacon
2014-05-07 19:12 ` Peter Zijlstra
2014-05-07 21:20   ` Will Deacon
2014-05-08  9:13     ` Peter Zijlstra
2014-05-08 14:27       ` Peter Zijlstra
2014-05-08 14:43         ` David Woodhouse
2014-05-08 15:13         ` Will Deacon
2014-05-08 16:39           ` Paul E. McKenney
2014-05-07 21:17 ` Waiman Long
2014-05-07 21:26   ` Peter Zijlstra
2014-05-07 22:29     ` Waiman Long
2014-05-08 14:16   ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox