[PATCH v2 0/2] fix dying cpu compare race

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] fix dying cpu compare race
@ 2023-04-06  1:56 Ye Bin
  2023-04-06  1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin
  2023-04-06  1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin
  0 siblings, 2 replies; 8+ messages in thread
From: Ye Bin @ 2023-04-06  1:56 UTC (permalink / raw)
  To: dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux
  Cc: linux-kernel, dchinner, yebin10, yebin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=a, Size: 3307 bytes --]

From: Ye Bin <yebin10@huawei.com>

This patch set solve race between  '__percpu_counter_compare()' and cpu offline.
Before commit 5825bea05265("xfs: __percpu_counter_compare() inode count debug too expensive").
I got issue as follows when do cpu online/offline test:
smpboot: CPU 1 is now offline
XFS: Assertion failed: percpu_counter_compare(&mp->m_ifree, 0) >= 0, file: fs/xfs/xfs_trans.c, line: 622
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:110!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 25512 Comm: fsstress Not tainted 5.10.0-04288-gcb31bdc8c65d #8
RIP: 0010:assfail+0x77/0x8b fs/xfs/xfs_message.c:110
RSP: 0018:ffff88810a5df5c0 EFLAGS: 00010293
RAX: ffff88810f3a8000 RBX: 0000000000000201 RCX: ffffffffaa8bd7c0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000000 R08: ffff88810f3a8000 R09: ffffed103edf71cd
R10: ffff8881f6fb8e67 R11: ffffed103edf71cc R12: ffffffffab0108c0
R13: ffffffffab010220 R14: ffffffffffffffff R15: 0000000000000000
FS:  00007f8536e16b80(0000) GS:ffff8881f6f80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005617e1115f44 CR3: 000000015873a005 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 xfs_trans_unreserve_and_mod_sb+0x833/0xca0 fs/xfs/xfs_trans.c:622
 xlog_cil_commit+0x1169/0x29b0 fs/xfs/xfs_log_cil.c:1325
 __xfs_trans_commit+0x2c0/0xe20 fs/xfs/xfs_trans.c:889
 xfs_create_tmpfile+0x6a6/0x9a0 fs/xfs/xfs_inode.c:1320
 xfs_rename_alloc_whiteout fs/xfs/xfs_inode.c:3193 [inline]
 xfs_rename+0x58a/0x1e00 fs/xfs/xfs_inode.c:3245
 xfs_vn_rename+0x28e/0x410 fs/xfs/xfs_iops.c:436
 vfs_rename+0x10b5/0x1dd0 fs/namei.c:4329
 do_renameat2+0xa19/0xb10 fs/namei.c:4474
 __do_sys_renameat2 fs/namei.c:4512 [inline]
 __se_sys_renameat2 fs/namei.c:4509 [inline]
 __x64_sys_renameat2+0xe4/0x120 fs/namei.c:4509
 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x61/0xc6
RIP: 0033:0x7f853623d91d

I can reproduce above issue by injecting kernel latency to invalidate the quick
judgment of “__percpu_counter_compare()”.
For quick judgment logic, the number of CPUs may have decreased before calling
percpu_counter_cpu_dead() when concurrent with CPU offline. That leads to
calculation errors. For example:
Assumption:
(1) batch = 32
(2) The final count is 2
(3) The number of CPUs is 4
If the number of percpu variables on each CPU is as follows when CPU3 is offline:
 cpu0   cpu1   cpu2  cpu3
  31     31     31    31
 fbc->count = -122 -> 'percpu_counter_cpu_dead()' isn't called.
So at this point, check if percpu counter is greater than 0.
 abs(count - rhs) = -122
 batch * num_ online_ cpus() = 32 * 3 = 96 -> Online CPUs number become 3
That is: abs (count rhs) > batch * num_online_cpus() condition met. The actual
value is 2, but the fact that count<0 returns -1 is the opposite.

Ye Bin (2):
  cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count
  lib/percpu_counter: fix dying cpu compare race

 include/linux/cpumask.h | 20 ++++++++++++++++----
 kernel/cpu.c            |  2 ++
 lib/percpu_counter.c    | 11 ++++++++++-
 3 files changed, 28 insertions(+), 5 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count
  2023-04-06  1:56 [PATCH v2 0/2] fix dying cpu compare race Ye Bin
@ 2023-04-06  1:56 ` Ye Bin
  2023-04-10 17:42   ` Yury Norov
  2023-04-10 20:12   ` Thomas Gleixner
  2023-04-06  1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin
  1 sibling, 2 replies; 8+ messages in thread
From: Ye Bin @ 2023-04-06  1:56 UTC (permalink / raw)
  To: dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux
  Cc: linux-kernel, dchinner, yebin10, yebin

From: Ye Bin <yebin10@huawei.com>

Introduce '__num_dying_cpus' variable to cache the number of dying CPUs
in the core and just return the cached variable.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 include/linux/cpumask.h | 20 ++++++++++++++++----
 kernel/cpu.c            |  2 ++
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 2a61ddcf8321..8127fd598f51 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -135,6 +135,8 @@ extern struct cpumask __cpu_dying_mask;
 
 extern atomic_t __num_online_cpus;
 
+extern atomic_t __num_dying_cpus;
+
 extern cpumask_t cpus_booted_once_mask;
 
 static __always_inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
@@ -1018,10 +1020,14 @@ set_cpu_active(unsigned int cpu, bool active)
 static __always_inline void
 set_cpu_dying(unsigned int cpu, bool dying)
 {
-	if (dying)
-		cpumask_set_cpu(cpu, &__cpu_dying_mask);
-	else
-		cpumask_clear_cpu(cpu, &__cpu_dying_mask);
+	if (dying) {
+		if (!cpumask_test_and_set_cpu(cpu, &__cpu_dying_mask))
+			atomic_inc(&__num_dying_cpus);
+	}
+	else {
+		if (cpumask_test_and_clear_cpu(cpu, &__cpu_dying_mask))
+			atomic_dec(&__num_dying_cpus);
+	}
 }
 
 /**
@@ -1073,6 +1079,11 @@ static __always_inline unsigned int num_online_cpus(void)
 {
 	return arch_atomic_read(&__num_online_cpus);
 }
+
+static __always_inline unsigned int num_dying_cpus(void)
+{
+	return arch_atomic_read(&__num_dying_cpus);
+}
 #define num_possible_cpus()	cpumask_weight(cpu_possible_mask)
 #define num_present_cpus()	cpumask_weight(cpu_present_mask)
 #define num_active_cpus()	cpumask_weight(cpu_active_mask)
@@ -1108,6 +1119,7 @@ static __always_inline bool cpu_dying(unsigned int cpu)
 #define num_possible_cpus()	1U
 #define num_present_cpus()	1U
 #define num_active_cpus()	1U
+#define num_dying_cpus()	0U
 
 static __always_inline bool cpu_online(unsigned int cpu)
 {
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f4a2c5845bcb..1c96c04cb259 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2662,6 +2662,8 @@ EXPORT_SYMBOL(__cpu_dying_mask);
 atomic_t __num_online_cpus __read_mostly;
 EXPORT_SYMBOL(__num_online_cpus);
 
+atomic_t __num_dying_cpus __read_mostly;
+
 void init_cpu_present(const struct cpumask *src)
 {
 	cpumask_copy(&__cpu_present_mask, src);
-- 
2.31.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count
  2023-04-06  1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin
@ 2023-04-10 17:42   ` Yury Norov
  2023-04-10 20:12   ` Thomas Gleixner
  1 sibling, 0 replies; 8+ messages in thread
From: Yury Norov @ 2023-04-10 17:42 UTC (permalink / raw)
  To: Ye Bin
  Cc: dennis, tj, cl, linux-mm, andriy.shevchenko, linux, linux-kernel,
	dchinner, yebin10

On Thu, Apr 06, 2023 at 09:56:28AM +0800, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
> 
> Introduce '__num_dying_cpus' variable to cache the number of dying CPUs
> in the core and just return the cached variable.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>

It looks like you didn't address any comments for v1. Can you please
do that? Otherwise, NAK.

Thanks,
Yury


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count
  2023-04-06  1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin
  2023-04-10 17:42   ` Yury Norov
@ 2023-04-10 20:12   ` Thomas Gleixner
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2023-04-10 20:12 UTC (permalink / raw)
  To: Ye Bin, dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux
  Cc: linux-kernel, dchinner, yebin10, yebin

On Thu, Apr 06 2023 at 09:56, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
>
> Introduce '__num_dying_cpus' variable to cache the number of dying CPUs
> in the core and just return the cached variable.

Why?

That atomic counter is racy too if read and acted upon w/o having CPUs
read locked.

All it does is making the race window smaller vs. the cpumask_weight()
based implementation. It's still racy and incorrect.

So no, this is not going to happen.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race
  2023-04-06  1:56 [PATCH v2 0/2] fix dying cpu compare race Ye Bin
  2023-04-06  1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin
@ 2023-04-06  1:56 ` Ye Bin
  2023-04-08  7:13   ` Dennis Zhou
  2023-04-10 20:53   ` Thomas Gleixner
  1 sibling, 2 replies; 8+ messages in thread
From: Ye Bin @ 2023-04-06  1:56 UTC (permalink / raw)
  To: dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux
  Cc: linux-kernel, dchinner, yebin10, yebin

From: Ye Bin <yebin10@huawei.com>

In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
condition between a cpu dying and percpu_counter_sum() iterating online CPUs
was identified.
Acctually, there's the same race condition between a cpu dying and
__percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
then maybe return incorrect result.
To solve above issue, also need to add dying CPUs count when do quick judgment
in __percpu_counter_compare().

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 lib/percpu_counter.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 5004463c4f9f..399840cb0012 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu)
 	return 0;
 }
 
+static __always_inline unsigned int num_count_cpus(void)
+{
+#ifdef CONFIG_HOTPLUG_CPU
+	return (num_online_cpus() + num_dying_cpus());
+#else
+	return num_online_cpus();
+#endif
+}
+
 /*
  * Compare counter against given value.
  * Return 1 if greater, 0 if equal and -1 if less
@@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 
 	count = percpu_counter_read(fbc);
 	/* Check to see if rough count will be sufficient for comparison */
-	if (abs(count - rhs) > (batch * num_online_cpus())) {
+	if (abs(count - rhs) > (batch * num_count_cpus())) {
 		if (count > rhs)
 			return 1;
 		else
-- 
2.31.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race
  2023-04-06  1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin
@ 2023-04-08  7:13   ` Dennis Zhou
  2023-04-10 20:53   ` Thomas Gleixner
  1 sibling, 0 replies; 8+ messages in thread
From: Dennis Zhou @ 2023-04-08  7:13 UTC (permalink / raw)
  To: Ye Bin
  Cc: tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux,
	linux-kernel, yebin10, dchinner

Hello,

On Thu, Apr 06, 2023 at 09:56:29AM +0800, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
> 
> In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
> condition between a cpu dying and percpu_counter_sum() iterating online CPUs
> was identified.
> Acctually, there's the same race condition between a cpu dying and
> __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
> But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
> then maybe return incorrect result.
> To solve above issue, also need to add dying CPUs count when do quick judgment
> in __percpu_counter_compare().
> 

I've thought a lot of about this since you've sent v1. For the general
problem, you haven't addressed Dave's concerns from [1].

I agree you've found a valid race condition, but as Dave mentioned,
there's no synchronization in __percpu_counter_compare() and
consequently no guarantees about the accuracy of the value.

However, I might be missing something, but I do think the use case in 
5825bea05265 ("xfs: __percpu_counter_compare() inode count debug too expensive")
is potentially valid. If the rhs is an expected lower bound or upper
bound (depending on if you're counting up or down, but not both) and the
count you're accounting has the same expectations as percpu_refcount
(you can only subtract what you've already added), then should the
percpu_counter_sum() ever be on the wrong side of rhs, that should be an
error and visible via percpu_counter_compare().

I need to think a little longer, but my initial thought is while you
close a race condition, the function itself is inherently vulnerable.

[1] https://lore.kernel.org/lkml/ZCu9LtdA+NMrfG9x@rh/

Thanks,
Dennis

> Signed-off-by: Ye Bin <yebin10@huawei.com>
> ---
>  lib/percpu_counter.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
> index 5004463c4f9f..399840cb0012 100644
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu)
>  	return 0;
>  }
>  
> +static __always_inline unsigned int num_count_cpus(void)
> +{
> +#ifdef CONFIG_HOTPLUG_CPU
> +	return (num_online_cpus() + num_dying_cpus());
> +#else
> +	return num_online_cpus();
> +#endif
> +}
> +
>  /*
>   * Compare counter against given value.
>   * Return 1 if greater, 0 if equal and -1 if less
> @@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
>  
>  	count = percpu_counter_read(fbc);
>  	/* Check to see if rough count will be sufficient for comparison */
> -	if (abs(count - rhs) > (batch * num_online_cpus())) {
> +	if (abs(count - rhs) > (batch * num_count_cpus())) {
>  		if (count > rhs)
>  			return 1;
>  		else
> -- 
> 2.31.1
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race
  2023-04-06  1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin
  2023-04-08  7:13   ` Dennis Zhou
@ 2023-04-10 20:53   ` Thomas Gleixner
  2023-04-11  6:56     ` Thomas Gleixner
  1 sibling, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2023-04-10 20:53 UTC (permalink / raw)
  To: Ye Bin, dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux
  Cc: linux-kernel, dchinner, yebin10, yebin, Peter Zijlstra,
	Valentin Schneider, Linus Torvalds

On Thu, Apr 06 2023 at 09:56, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
>
> In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race
> condition between a cpu dying and percpu_counter_sum() iterating online CPUs
> was identified.
> Acctually, there's the same race condition between a cpu dying and
> __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment.
> But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()',
> then maybe return incorrect result.
> To solve above issue, also need to add dying CPUs count when do quick judgment
> in __percpu_counter_compare().

This is all bogus including the above commit.

All CPU masks including cpu_online_mask and cpu_dying_mask are only
valid when the context is:

  - A CPU hotplug callback

  - Any other code which holds cpu_hotplug_lock read or write locked.

cpu_online_mask is special in that regard. It is also protected against
modification in any preemption disabled region, but that's a pure
implementation detail.

cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being
held. And even with that lock held any cpumask operation on it is silly.
The mask is a core detail:

  commit e40f74c535b8 "cpumask: Introduce DYING mask"

    Introduce a cpumask that indicates (for each CPU) what direction the
    CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
    an up fails and we do a roll-back down, it will accurately reflect the
    direction.

It does not tell anything to a user which is not aware of the actual
hotplug state machine state.

The real reason for this percpu counter issue is how percpu counter
hotplug callbacks are implemented: They are asymmetric and at the
completely wrong place.

The above 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") was
done via XFS and without the courtesy of CC'ing the people who care
about the CPU hotplug core. The lenghty analysis of this commit is all
shiny, but fundamentally wrong. See above.

I'm way too tired to come up with a proper fix for this mess, but I'm
going to stare at it tomorrow morning with brain awake.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race
  2023-04-10 20:53   ` Thomas Gleixner
@ 2023-04-11  6:56     ` Thomas Gleixner
  0 siblings, 0 replies; 8+ messages in thread
From: Thomas Gleixner @ 2023-04-11  6:56 UTC (permalink / raw)
  To: Ye Bin, dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux
  Cc: linux-kernel, dchinner, yebin10, yebin, Peter Zijlstra,
	Valentin Schneider, Linus Torvalds

On Mon, Apr 10 2023 at 22:53, Thomas Gleixner wrote:
> On Thu, Apr 06 2023 at 09:56, Ye Bin wrote:
> cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being
> held. And even with that lock held any cpumask operation on it is silly.
> The mask is a core detail:
>
>   commit e40f74c535b8 "cpumask: Introduce DYING mask"
>     
>     Introduce a cpumask that indicates (for each CPU) what direction the
>     CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
>     an up fails and we do a roll-back down, it will accurately reflect the
>     direction.
>
> It does not tell anything to a user which is not aware of the actual
> hotplug state machine state.

Even if the mask is most of the time stable, it's a total disaster
performance wise. The bits in cpu_dying_mask are sticky until the next
online operation.

So for a system which has SMT enabled in BIOS, but SMT is disabled on
the kernel command line or later via the sysfs knob, this means that the
loop in __percpu_counter_sum() will iterate over all shutdown SMT
siblings forever. IOW, it will touch nr_of_shutdown_cpus() cachelines
for absolutely zero reason.

The same applies for the proposed counter.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-11  6:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-06  1:56 [PATCH v2 0/2] fix dying cpu compare race Ye Bin
2023-04-06  1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin
2023-04-10 17:42   ` Yury Norov
2023-04-10 20:12   ` Thomas Gleixner
2023-04-06  1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin
2023-04-08  7:13   ` Dennis Zhou
2023-04-10 20:53   ` Thomas Gleixner
2023-04-11  6:56     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox