linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
@ 2025-02-13 19:14 Uros Bizjak
  2025-02-13 19:14 ` [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations Uros Bizjak
  2025-02-13 20:43 ` [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() Dave Hansen
  0 siblings, 2 replies; 11+ messages in thread
From: Uros Bizjak @ 2025-02-13 19:14 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel
  Cc: Uros Bizjak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Peter Zijlstra (Intel)

percpu_{,try_}cmpxchg{64,128}() macros use CALL instruction inside
asm statement in one of their alternatives. Use ALT_OUTPUT_SP()
macro to add required dependence on %esp register.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
---
 arch/x86/include/asm/percpu.h | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index e525cd85f999..0ab991fba7de 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -350,9 +350,9 @@ do {									\
 									\
 	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
 			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
-		  : [var] "+m" (__my_cpu_var(_var)),			\
-		    "+a" (old__.low),					\
-		    "+d" (old__.high)					\
+		  : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
+				  "+a" (old__.low),			\
+				  "+d" (old__.high))			\
 		  : "b" (new__.low),					\
 		    "c" (new__.high),					\
 		    "S" (&(_var))					\
@@ -381,10 +381,10 @@ do {									\
 	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
 			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
 		  CC_SET(z)						\
-		  : CC_OUT(z) (success),				\
-		    [var] "+m" (__my_cpu_var(_var)),			\
-		    "+a" (old__.low),					\
-		    "+d" (old__.high)					\
+		  : ALT_OUTPUT_SP(CC_OUT(z) (success),			\
+				  [var] "+m" (__my_cpu_var(_var)),	\
+				  "+a" (old__.low),			\
+				  "+d" (old__.high))			\
 		  : "b" (new__.low),					\
 		    "c" (new__.high),					\
 		    "S" (&(_var))					\
@@ -421,9 +421,9 @@ do {									\
 									\
 	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
 			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
-		  : [var] "+m" (__my_cpu_var(_var)),			\
-		    "+a" (old__.low),					\
-		    "+d" (old__.high)					\
+		  : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
+				  "+a" (old__.low),			\
+				  "+d" (old__.high))			\
 		  : "b" (new__.low),					\
 		    "c" (new__.high),					\
 		    "S" (&(_var))					\
@@ -452,10 +452,10 @@ do {									\
 	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
 			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
 		  CC_SET(z)						\
-		  : CC_OUT(z) (success),				\
-		    [var] "+m" (__my_cpu_var(_var)),			\
-		    "+a" (old__.low),					\
-		    "+d" (old__.high)					\
+		  : ALT_OUTPUT_SP(CC_OUT(z) (success),			\
+				  [var] "+m" (__my_cpu_var(_var)),	\
+				  "+a" (old__.low),			\
+				  "+d" (old__.high))			\
 		  : "b" (new__.low),					\
 		    "c" (new__.high),					\
 		    "S" (&(_var))					\
-- 
2.42.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations
  2025-02-13 19:14 [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() Uros Bizjak
@ 2025-02-13 19:14 ` Uros Bizjak
  2025-02-13 20:48   ` Dave Hansen
  2025-02-13 20:43 ` [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() Dave Hansen
  1 sibling, 1 reply; 11+ messages in thread
From: Uros Bizjak @ 2025-02-13 19:14 UTC (permalink / raw)
  To: x86, linux-mm, linux-kernel
  Cc: Uros Bizjak, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, Dennis Zhou, Tejun Heo,
	Christoph Lameter, Peter Zijlstra (Intel)

According to [1], the usage of asm pseudo directives in the asm template
can confuse the compiler to wrongly estimate the size of the generated
code. ALTERNATIVE macro expands to several asm pseudo directives, so
its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
to fail by an order of magnitude (the compiler estimates the length of
an asm to be more than 20 instructions). This wrong estimate further
causes unoptimal inlining decisions for functions that use these
locking primitives.

Use asm_inline instead of just asm. For inlining purposes, the size of
the asm is then taken as the minimum size, ignoring how many instructions
compiler thinks it is.

[1] https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
---
 arch/x86/include/asm/cmpxchg_32.h | 32 +++++++------
 arch/x86/include/asm/percpu.h     | 77 +++++++++++++++----------------
 2 files changed, 55 insertions(+), 54 deletions(-)

diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
index fd1282a783dd..95b5f990ca88 100644
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -91,12 +91,14 @@ static __always_inline bool __try_cmpxchg64_local(volatile u64 *ptr, u64 *oldp,
 	union __u64_halves o = { .full = (_old), },			\
 			   n = { .full = (_new), };			\
 									\
-	asm volatile(ALTERNATIVE(_lock_loc				\
-				 "call cmpxchg8b_emu",			\
-				 _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
-		     : ALT_OUTPUT_SP("+a" (o.low), "+d" (o.high))	\
-		     : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr)	\
-		     : "memory");					\
+	asm_inline volatile(						\
+		ALTERNATIVE(_lock_loc					\
+			    "call cmpxchg8b_emu",			\
+			    _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8)	\
+		: ALT_OUTPUT_SP("+a" (o.low), "+d" (o.high))		\
+		: "b" (n.low), "c" (n.high),				\
+		  [ptr] "S" (_ptr)					\
+		: "memory");						\
 									\
 	o.full;								\
 })
@@ -119,14 +121,16 @@ static __always_inline u64 arch_cmpxchg64_local(volatile u64 *ptr, u64 old, u64
 			   n = { .full = (_new), };			\
 	bool ret;							\
 									\
-	asm volatile(ALTERNATIVE(_lock_loc				\
-				 "call cmpxchg8b_emu",			\
-				 _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
-		     CC_SET(e)						\
-		     : ALT_OUTPUT_SP(CC_OUT(e) (ret),			\
-				     "+a" (o.low), "+d" (o.high))	\
-		     : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr)	\
-		     : "memory");					\
+	asm_inline volatile(						\
+		ALTERNATIVE(_lock_loc					\
+			    "call cmpxchg8b_emu",			\
+			    _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
+		CC_SET(e)						\
+		: ALT_OUTPUT_SP(CC_OUT(e) (ret),			\
+				"+a" (o.low), "+d" (o.high))		\
+		: "b" (n.low), "c" (n.high),				\
+		  [ptr] "S" (_ptr)					\
+		: "memory");						\
 									\
 	if (unlikely(!ret))						\
 		*(_oldp) = o.full;					\
diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 0ab991fba7de..08f5f61690b7 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -348,15 +348,14 @@ do {									\
 	old__.var = _oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
-			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
-		  : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
-				  "+a" (old__.low),			\
-				  "+d" (old__.high))			\
-		  : "b" (new__.low),					\
-		    "c" (new__.high),					\
-		    "S" (&(_var))					\
-		  : "memory");						\
+	asm_inline qual (						\
+		ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
+			    "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
+		: ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
+				"+a" (old__.low), "+d" (old__.high))	\
+		: "b" (new__.low), "c" (new__.high),			\
+		  "S" (&(_var))						\
+		: "memory");						\
 									\
 	old__.var;							\
 })
@@ -378,17 +377,16 @@ do {									\
 	old__.var = *_oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
-			      "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
-		  CC_SET(z)						\
-		  : ALT_OUTPUT_SP(CC_OUT(z) (success),			\
-				  [var] "+m" (__my_cpu_var(_var)),	\
-				  "+a" (old__.low),			\
-				  "+d" (old__.high))			\
-		  : "b" (new__.low),					\
-		    "c" (new__.high),					\
-		    "S" (&(_var))					\
-		  : "memory");						\
+	asm_inline qual (						\
+		ALTERNATIVE("call this_cpu_cmpxchg8b_emu",		\
+			    "cmpxchg8b " __percpu_arg([var]), X86_FEATURE_CX8) \
+		CC_SET(z)						\
+		: ALT_OUTPUT_SP(CC_OUT(z) (success),			\
+				[var] "+m" (__my_cpu_var(_var)),	\
+				"+a" (old__.low), "+d" (old__.high))	\
+		: "b" (new__.low), "c" (new__.high),			\
+		  "S" (&(_var))						\
+		: "memory");						\
 	if (unlikely(!success))						\
 		*_oval = old__.var;					\
 									\
@@ -419,15 +417,14 @@ do {									\
 	old__.var = _oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
-			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
-		  : ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
-				  "+a" (old__.low),			\
-				  "+d" (old__.high))			\
-		  : "b" (new__.low),					\
-		    "c" (new__.high),					\
-		    "S" (&(_var))					\
-		  : "memory");						\
+	asm_inline qual (						\
+		ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
+			    "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
+		: ALT_OUTPUT_SP([var] "+m" (__my_cpu_var(_var)),	\
+				"+a" (old__.low), "+d" (old__.high))	\
+		: "b" (new__.low), "c" (new__.high),			\
+		  "S" (&(_var))						\
+		: "memory");						\
 									\
 	old__.var;							\
 })
@@ -449,19 +446,19 @@ do {									\
 	old__.var = *_oval;						\
 	new__.var = _nval;						\
 									\
-	asm qual (ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
-			      "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
-		  CC_SET(z)						\
-		  : ALT_OUTPUT_SP(CC_OUT(z) (success),			\
-				  [var] "+m" (__my_cpu_var(_var)),	\
-				  "+a" (old__.low),			\
-				  "+d" (old__.high))			\
-		  : "b" (new__.low),					\
-		    "c" (new__.high),					\
-		    "S" (&(_var))					\
-		  : "memory");						\
+	asm_inline qual (						\
+		ALTERNATIVE("call this_cpu_cmpxchg16b_emu",		\
+			    "cmpxchg16b " __percpu_arg([var]), X86_FEATURE_CX16) \
+		CC_SET(z)						\
+		: ALT_OUTPUT_SP(CC_OUT(z) (success),			\
+				[var] "+m" (__my_cpu_var(_var)),	\
+				"+a" (old__.low), "+d" (old__.high))	\
+		: "b" (new__.low), "c" (new__.high),			\
+		  "S" (&(_var))						\
+		: "memory");						\
 	if (unlikely(!success))						\
 		*_oval = old__.var;					\
+									\
 	likely(success);						\
 })
 
-- 
2.42.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
  2025-02-13 19:14 [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() Uros Bizjak
  2025-02-13 19:14 ` [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations Uros Bizjak
@ 2025-02-13 20:43 ` Dave Hansen
  2025-02-13 21:17   ` Uros Bizjak
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2025-02-13 20:43 UTC (permalink / raw)
  To: Uros Bizjak, x86, linux-mm, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Peter Zijlstra (Intel)

On 2/13/25 11:14, Uros Bizjak wrote:
> percpu_{,try_}cmpxchg{64,128}() macros use CALL instruction inside
> asm statement in one of their alternatives. Use ALT_OUTPUT_SP()
> macro to add required dependence on %esp register.

Is this just a pedantic fix? Or is there an actual impact to end users
that needs to be considered?

Basically, you've told me what the patch does, but not why anyone should
care or why it should be applied.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations
  2025-02-13 19:14 ` [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations Uros Bizjak
@ 2025-02-13 20:48   ` Dave Hansen
  2025-02-13 22:13     ` Uros Bizjak
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2025-02-13 20:48 UTC (permalink / raw)
  To: Uros Bizjak, x86, linux-mm, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Peter Zijlstra (Intel)

On 2/13/25 11:14, Uros Bizjak wrote:
> According to [1], the usage of asm pseudo directives in the asm template
> can confuse the compiler to wrongly estimate the size of the generated
> code. ALTERNATIVE macro expands to several asm pseudo directives, so
> its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
> to fail by an order of magnitude (the compiler estimates the length of
> an asm to be more than 20 instructions). 

Just curious, but how did you come up with the "20 instructions" number?

> This wrong estimate further causes unoptimal inlining decisions for
> functions that use these locking primitives.
> 
> Use asm_inline instead of just asm. For inlining purposes, the size of
> the asm is then taken as the minimum size, ignoring how many instructions
> compiler thinks it is.

So, the compiler is trying to decide whether to inline a function or
not. The bigger it is, the less likely, it is to be inlined. Since it is
over-estimating the size of {,try_}cmpxchg{64,128}, it will avoid
inlining it when it _should_ be inlining it.

Is that it?

Is any of this measurable? Is there any objective data to support that
this change is a good one?

It's quite possible that someone did the "asm" on purpose because
over-estimating the size was a good thing.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
  2025-02-13 20:43 ` [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() Dave Hansen
@ 2025-02-13 21:17   ` Uros Bizjak
  2025-02-13 22:54     ` Dave Hansen
  2025-02-14 18:22     ` Christoph Lameter (Ampere)
  0 siblings, 2 replies; 11+ messages in thread
From: Uros Bizjak @ 2025-02-13 21:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Peter Zijlstra (Intel)

On Thu, Feb 13, 2025 at 9:43 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/13/25 11:14, Uros Bizjak wrote:
> > percpu_{,try_}cmpxchg{64,128}() macros use CALL instruction inside
> > asm statement in one of their alternatives. Use ALT_OUTPUT_SP()
> > macro to add required dependence on %esp register.
>
> Is this just a pedantic fix? Or is there an actual impact to end users
> that needs to be considered?

When call insn is embedded in the asm, then the compiler doesn't know
that due to call insn asm now depends on stack pointer or frame
pointer, so it is free to schedule the instruction outside the
function frame prologue/epilogue. Currently, this only triggers
objtool warning, but if we ever compile the kernel with redzone (IIRC,
it was mentioned that this is possible with FRED enabled kernel), the
call will clobber the redzone. Please note that alternative_call()
family of functions, __alternative_atomic64() and
__arch_{,try_}cmpxchg64_emu() all use the same macro exactly for the
reason explained above.

OTOH, all recent x86_64 processors support CMPXCHG128 insn, so the
call alternative will be rarely used.

> Basically, you've told me what the patch does, but not why anyone should
> care or why it should be applied.

This is actually explained at length in the comment for
ASM_CALL_CONSTRAINT, which ALT_OUTPUT_SP macro uses.

Uros.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations
  2025-02-13 20:48   ` Dave Hansen
@ 2025-02-13 22:13     ` Uros Bizjak
  2025-02-13 22:52       ` Dave Hansen
  0 siblings, 1 reply; 11+ messages in thread
From: Uros Bizjak @ 2025-02-13 22:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Peter Zijlstra (Intel)

On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/13/25 11:14, Uros Bizjak wrote:
> > According to [1], the usage of asm pseudo directives in the asm template
> > can confuse the compiler to wrongly estimate the size of the generated
> > code. ALTERNATIVE macro expands to several asm pseudo directives, so
> > its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
> > to fail by an order of magnitude (the compiler estimates the length of
> > an asm to be more than 20 instructions).
>
> Just curious, but how did you come up with the "20 instructions" number?

Currently, a patched GCC compiler is needed (please see
asm_insn_count() and asm_str_count() functions in gcc/final.cc on how
the asm length is calculated) to report the length. For historic
reasons, the length of asm is not printed in asm dumps, but recently a
GCC PR was filled with a request to change this).

> > This wrong estimate further causes unoptimal inlining decisions for
> > functions that use these locking primitives.
> >
> > Use asm_inline instead of just asm. For inlining purposes, the size of
> > the asm is then taken as the minimum size, ignoring how many instructions
> > compiler thinks it is.
>
> So, the compiler is trying to decide whether to inline a function or
> not. The bigger it is, the less likely, it is to be inlined. Since it is
> over-estimating the size of {,try_}cmpxchg{64,128}, it will avoid
> inlining it when it _should_ be inlining it.
>
> Is that it?

Yes, because the calculated length of what is effectively one
instruction gets unreasonably high. The compiler counts 20
*instructions*, each estimated to be 16 bytes long.

> Is any of this measurable? Is there any objective data to support that
> this change is a good one?

Actually, "asm inline" was added to the GCC compiler just for this
purpose by request from the linux community [1]. My patch follows the
example of other similar macros (e.g. arch/x86/include/alternative.h)
and adds the same cure to asms that will undoubtedly result in a
single instruction [*].  The benefit is much more precise length
estimation, so compiler heuristic is able to correctly estimate the
benefit of inlining, not being skewed by excessive use of
__always_inline directive. OTOH, it is hard to back up compiler
decisions by objective data, as inlining decisions depend on several
factors besides function size (e.g. how hot/cold is function), so a
simple comparison of kernel sizes does not present the full picture.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2018-December/512349.html

[*] Please note that if asm template is using CC_SET, the compiler may
emit an additional SETx asm insn. However, all GCC versions that
support "asm inline" also support flag outputs, so they are guaranteed
to emit only one asm insn.

> It's quite possible that someone did the "asm" on purpose because
> over-estimating the size was a good thing.

I doubt this would be the case, and I would consider the code that
depends on this detail defective. The code that results in one asm
instruction should be accounted as such, no matter what internal
details are exposed in the instruction asm template.

Uros.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations
  2025-02-13 22:13     ` Uros Bizjak
@ 2025-02-13 22:52       ` Dave Hansen
  2025-02-14  7:25         ` Uros Bizjak
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Hansen @ 2025-02-13 22:52 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-mm, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Peter Zijlstra (Intel)

On 2/13/25 14:13, Uros Bizjak wrote:
> On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 2/13/25 11:14, Uros Bizjak wrote:
>>> According to [1], the usage of asm pseudo directives in the asm template
>>> can confuse the compiler to wrongly estimate the size of the generated
>>> code. ALTERNATIVE macro expands to several asm pseudo directives, so
>>> its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
>>> to fail by an order of magnitude (the compiler estimates the length of
>>> an asm to be more than 20 instructions).
>>
>> Just curious, but how did you come up with the "20 instructions" number?
> 
> Currently, a patched GCC compiler is needed (please see
> asm_insn_count() and asm_str_count() functions in gcc/final.cc on how
> the asm length is calculated) to report the length. For historic
> reasons, the length of asm is not printed in asm dumps, but recently a
> GCC PR was filled with a request to change this).

So, that's also good info to add. You can  even do it in the changelog
with little more space than the existing changelog:

	... fail by an order of magnitude (a hacked-up gcc shows that it
	estimates the length of an asm to be more than 20 instructions).

...
>> Is any of this measurable? Is there any objective data to support that
>> this change is a good one?
> 
> Actually, "asm inline" was added to the GCC compiler just for this
> purpose by request from the linux community [1].

Wow, that's really important important information. Shouldn't the fact
that this is leveraging a new feature that we asked for specifically get
called out somewhere?

Who asked for it? Are they on cc? Do they agree that this feature fills
the gap they wanted filled?

> My patch follows the
> example of other similar macros (e.g. arch/x86/include/alternative.h)
> and adds the same cure to asms that will undoubtedly result in a
> single instruction [*].  The benefit is much more precise length
> estimation, so compiler heuristic is able to correctly estimate the
> benefit of inlining, not being skewed by excessive use of
> __always_inline directive. OTOH, it is hard to back up compiler
> decisions by objective data, as inlining decisions depend on several
> factors besides function size (e.g. how hot/cold is function), so a
> simple comparison of kernel sizes does not present the full picture.

Yes, the world is complicated. But, honestly, one data point is a
billion times better than zero. Right now, we're at zero.

>> It's quite possible that someone did the "asm" on purpose because
>> over-estimating the size was a good thing.
> 
> I doubt this would be the case, and I would consider the code that
> depends on this detail defective. The code that results in one asm
> instruction should be accounted as such, no matter what internal
> details are exposed in the instruction asm template.

Yeah, but defective or not, if this causes a regression, it's either not
getting applied to gets reverted.

All that I'm asking here is that someone look at the kernel after the
patch gets applied and sanity check it. Absolutely basic scientific
method stuff. Make a hypothesis about what it will do:

	1. Inline these locking functions
	2. Make the kernel go faster for _something_

and if it doesn't match the hypothesis, then try and figure out why. You
don't have to do every config or every compiler. Just do one config and
one modern compiler.

Right now, this patch is saying:

	1. gcc appears to have done something that might be suboptimal
	2. gcc has a new feature that might make it less suboptimal
	3. here's a patch that should optimize things
	...

but then it leaves us hanging.  There's a lot of "mights" and "shoulds"
in there, but nothing that shows that this actually does anything
positive in practice.

Maybe I'm just a dummy and this is just an obvious improvement that I
can't grasp. If so, sorry for being so dense, but I'm going to need a
little more education before this gets applied.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
  2025-02-13 21:17   ` Uros Bizjak
@ 2025-02-13 22:54     ` Dave Hansen
  2025-02-14 18:22     ` Christoph Lameter (Ampere)
  1 sibling, 0 replies; 11+ messages in thread
From: Dave Hansen @ 2025-02-13 22:54 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: x86, linux-mm, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Peter Zijlstra (Intel)

On 2/13/25 13:17, Uros Bizjak wrote:
>> Basically, you've told me what the patch does, but not why anyone should
>> care or why it should be applied.
> This is actually explained at length in the comment for
> ASM_CALL_CONSTRAINT, which ALT_OUTPUT_SP macro uses.

Great info, thanks! Could you give the patch another shot and include
this in the changelog, please? Better yet, you could paraphrase the
comment so that we don't have to go searching for it.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations
  2025-02-13 22:52       ` Dave Hansen
@ 2025-02-14  7:25         ` Uros Bizjak
  0 siblings, 0 replies; 11+ messages in thread
From: Uros Bizjak @ 2025-02-14  7:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: x86, linux-mm, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Dennis Zhou,
	Tejun Heo, Christoph Lameter, Peter Zijlstra (Intel)

On Thu, Feb 13, 2025 at 11:52 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 2/13/25 14:13, Uros Bizjak wrote:
> > On Thu, Feb 13, 2025 at 9:48 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 2/13/25 11:14, Uros Bizjak wrote:
> >>> According to [1], the usage of asm pseudo directives in the asm template
> >>> can confuse the compiler to wrongly estimate the size of the generated
> >>> code. ALTERNATIVE macro expands to several asm pseudo directives, so
> >>> its usage in {,try_}cmpxchg{64,128} causes instruction length estimate
> >>> to fail by an order of magnitude (the compiler estimates the length of
> >>> an asm to be more than 20 instructions).
> >>
> >> Just curious, but how did you come up with the "20 instructions" number?
> >
> > Currently, a patched GCC compiler is needed (please see
> > asm_insn_count() and asm_str_count() functions in gcc/final.cc on how
> > the asm length is calculated) to report the length. For historic
> > reasons, the length of asm is not printed in asm dumps, but recently a
> > GCC PR was filled with a request to change this).
>
> So, that's also good info to add. You can  even do it in the changelog
> with little more space than the existing changelog:
>
>         ... fail by an order of magnitude (a hacked-up gcc shows that it
>         estimates the length of an asm to be more than 20 instructions).
>
> ...
> >> Is any of this measurable? Is there any objective data to support that
> >> this change is a good one?
> >
> > Actually, "asm inline" was added to the GCC compiler just for this
> > purpose by request from the linux community [1].
>
> Wow, that's really important important information. Shouldn't the fact
> that this is leveraging a new feature that we asked for specifically get
> called out somewhere?
>
> Who asked for it? Are they on cc? Do they agree that this feature fills
> the gap they wanted filled?

asm_inline is already used in some 40-50 places throughout the tree,
but there still remain some places that could benefit from it.

> > My patch follows the
> > example of other similar macros (e.g. arch/x86/include/alternative.h)
> > and adds the same cure to asms that will undoubtedly result in a
> > single instruction [*].  The benefit is much more precise length
> > estimation, so compiler heuristic is able to correctly estimate the
> > benefit of inlining, not being skewed by excessive use of
> > __always_inline directive. OTOH, it is hard to back up compiler
> > decisions by objective data, as inlining decisions depend on several
> > factors besides function size (e.g. how hot/cold is function), so a
> > simple comparison of kernel sizes does not present the full picture.
>
> Yes, the world is complicated. But, honestly, one data point is a
> billion times better than zero. Right now, we're at zero.
>
> >> It's quite possible that someone did the "asm" on purpose because
> >> over-estimating the size was a good thing.
> >
> > I doubt this would be the case, and I would consider the code that
> > depends on this detail defective. The code that results in one asm
> > instruction should be accounted as such, no matter what internal
> > details are exposed in the instruction asm template.
>
> Yeah, but defective or not, if this causes a regression, it's either not
> getting applied to gets reverted.
>
> All that I'm asking here is that someone look at the kernel after the
> patch gets applied and sanity check it. Absolutely basic scientific
> method stuff. Make a hypothesis about what it will do:
>
>         1. Inline these locking functions
>         2. Make the kernel go faster for _something_
>
> and if it doesn't match the hypothesis, then try and figure out why. You
> don't have to do every config or every compiler. Just do one config and
> one modern compiler.
>
> Right now, this patch is saying:
>
>         1. gcc appears to have done something that might be suboptimal
>         2. gcc has a new feature that might make it less suboptimal
>         3. here's a patch that should optimize things
>         ...
>
> but then it leaves us hanging.  There's a lot of "mights" and "shoulds"
> in there, but nothing that shows that this actually does anything
> positive in practice.

Let me harvest some data and report the findings in a V2 ChangeLog.
However, these particular macros are rarely used, so I don't expect
some big changes in the generated asm code.

> Maybe I'm just a dummy and this is just an obvious improvement that I
> can't grasp. If so, sorry for being so dense, but I'm going to need a
> little more education before this gets applied.

Thanks,
Uros.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
  2025-02-13 21:17   ` Uros Bizjak
  2025-02-13 22:54     ` Dave Hansen
@ 2025-02-14 18:22     ` Christoph Lameter (Ampere)
  2025-02-14 19:55       ` Uros Bizjak
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Lameter (Ampere) @ 2025-02-14 18:22 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Dave Hansen, x86, linux-mm, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Dennis Zhou, Tejun Heo, Peter Zijlstra (Intel)

On Thu, 13 Feb 2025, Uros Bizjak wrote:

> OTOH, all recent x86_64 processors support CMPXCHG128 insn, so the
> call alternative will be rarely used.

Do we still support processors without cmpxchg128? If not then lets just
drop the calls from the kernel.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
  2025-02-14 18:22     ` Christoph Lameter (Ampere)
@ 2025-02-14 19:55       ` Uros Bizjak
  0 siblings, 0 replies; 11+ messages in thread
From: Uros Bizjak @ 2025-02-14 19:55 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Dave Hansen, x86, linux-mm, linux-kernel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Dennis Zhou, Tejun Heo, Peter Zijlstra (Intel)

On Fri, Feb 14, 2025 at 7:22 PM Christoph Lameter (Ampere)
<cl@gentwo.org> wrote:
>
> On Thu, 13 Feb 2025, Uros Bizjak wrote:
>
> > OTOH, all recent x86_64 processors support CMPXCHG128 insn, so the
> > call alternative will be rarely used.
>
> Do we still support processors without cmpxchg128? If not then lets just
> drop the calls from the kernel.

I'm not aware of any discussion about that decision.

Thanks,
Uros.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-02-14 19:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-13 19:14 [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() Uros Bizjak
2025-02-13 19:14 ` [PATCH RESEND 2/2] x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations Uros Bizjak
2025-02-13 20:48   ` Dave Hansen
2025-02-13 22:13     ` Uros Bizjak
2025-02-13 22:52       ` Dave Hansen
2025-02-14  7:25         ` Uros Bizjak
2025-02-13 20:43 ` [PATCH RESEND 1/2] x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op() Dave Hansen
2025-02-13 21:17   ` Uros Bizjak
2025-02-13 22:54     ` Dave Hansen
2025-02-14 18:22     ` Christoph Lameter (Ampere)
2025-02-14 19:55       ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox