* [PATCH v2 0/2] fix dying cpu compare race
@ 2023-04-06 1:56 Ye Bin
2023-04-06 1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin
2023-04-06 1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin
0 siblings, 2 replies; 8+ messages in thread
From: Ye Bin @ 2023-04-06 1:56 UTC (permalink / raw)
To: dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux
Cc: linux-kernel, dchinner, yebin10, yebin
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=a, Size: 3307 bytes --]
From: Ye Bin <yebin10@huawei.com>
This patch set solve race between '__percpu_counter_compare()' and cpu offline.
Before commit 5825bea05265("xfs: __percpu_counter_compare() inode count debug too expensive").
I got issue as follows when do cpu online/offline test:
smpboot: CPU 1 is now offline
XFS: Assertion failed: percpu_counter_compare(&mp->m_ifree, 0) >= 0, file: fs/xfs/xfs_trans.c, line: 622
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:110!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 3 PID: 25512 Comm: fsstress Not tainted 5.10.0-04288-gcb31bdc8c65d #8
RIP: 0010:assfail+0x77/0x8b fs/xfs/xfs_message.c:110
RSP: 0018:ffff88810a5df5c0 EFLAGS: 00010293
RAX: ffff88810f3a8000 RBX: 0000000000000201 RCX: ffffffffaa8bd7c0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 0000000000000000 R08: ffff88810f3a8000 R09: ffffed103edf71cd
R10: ffff8881f6fb8e67 R11: ffffed103edf71cc R12: ffffffffab0108c0
R13: ffffffffab010220 R14: ffffffffffffffff R15: 0000000000000000
FS: 00007f8536e16b80(0000) GS:ffff8881f6f80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005617e1115f44 CR3: 000000015873a005 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
xfs_trans_unreserve_and_mod_sb+0x833/0xca0 fs/xfs/xfs_trans.c:622
xlog_cil_commit+0x1169/0x29b0 fs/xfs/xfs_log_cil.c:1325
__xfs_trans_commit+0x2c0/0xe20 fs/xfs/xfs_trans.c:889
xfs_create_tmpfile+0x6a6/0x9a0 fs/xfs/xfs_inode.c:1320
xfs_rename_alloc_whiteout fs/xfs/xfs_inode.c:3193 [inline]
xfs_rename+0x58a/0x1e00 fs/xfs/xfs_inode.c:3245
xfs_vn_rename+0x28e/0x410 fs/xfs/xfs_iops.c:436
vfs_rename+0x10b5/0x1dd0 fs/namei.c:4329
do_renameat2+0xa19/0xb10 fs/namei.c:4474
__do_sys_renameat2 fs/namei.c:4512 [inline]
__se_sys_renameat2 fs/namei.c:4509 [inline]
__x64_sys_renameat2+0xe4/0x120 fs/namei.c:4509
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x61/0xc6
RIP: 0033:0x7f853623d91d
I can reproduce above issue by injecting kernel latency to invalidate the quick
judgment of “__percpu_counter_compare()”.
For quick judgment logic, the number of CPUs may have decreased before calling
percpu_counter_cpu_dead() when concurrent with CPU offline. That leads to
calculation errors. For example:
Assumption:
(1) batch = 32
(2) The final count is 2
(3) The number of CPUs is 4
If the number of percpu variables on each CPU is as follows when CPU3 is offline:
cpu0 cpu1 cpu2 cpu3
31 31 31 31
fbc->count = -122 -> 'percpu_counter_cpu_dead()' isn't called.
So at this point, check if percpu counter is greater than 0.
abs(count - rhs) = -122
batch * num_ online_ cpus() = 32 * 3 = 96 -> Online CPUs number become 3
That is: abs (count rhs) > batch * num_online_cpus() condition met. The actual
value is 2, but the fact that count<0 returns -1 is the opposite.
Ye Bin (2):
cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count
lib/percpu_counter: fix dying cpu compare race
include/linux/cpumask.h | 20 ++++++++++++++++----
kernel/cpu.c | 2 ++
lib/percpu_counter.c | 11 ++++++++++-
3 files changed, 28 insertions(+), 5 deletions(-)
--
2.31.1
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count 2023-04-06 1:56 [PATCH v2 0/2] fix dying cpu compare race Ye Bin @ 2023-04-06 1:56 ` Ye Bin 2023-04-10 17:42 ` Yury Norov 2023-04-10 20:12 ` Thomas Gleixner 2023-04-06 1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin 1 sibling, 2 replies; 8+ messages in thread From: Ye Bin @ 2023-04-06 1:56 UTC (permalink / raw) To: dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux Cc: linux-kernel, dchinner, yebin10, yebin From: Ye Bin <yebin10@huawei.com> Introduce '__num_dying_cpus' variable to cache the number of dying CPUs in the core and just return the cached variable. Signed-off-by: Ye Bin <yebin10@huawei.com> --- include/linux/cpumask.h | 20 ++++++++++++++++---- kernel/cpu.c | 2 ++ 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 2a61ddcf8321..8127fd598f51 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -135,6 +135,8 @@ extern struct cpumask __cpu_dying_mask; extern atomic_t __num_online_cpus; +extern atomic_t __num_dying_cpus; + extern cpumask_t cpus_booted_once_mask; static __always_inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits) @@ -1018,10 +1020,14 @@ set_cpu_active(unsigned int cpu, bool active) static __always_inline void set_cpu_dying(unsigned int cpu, bool dying) { - if (dying) - cpumask_set_cpu(cpu, &__cpu_dying_mask); - else - cpumask_clear_cpu(cpu, &__cpu_dying_mask); + if (dying) { + if (!cpumask_test_and_set_cpu(cpu, &__cpu_dying_mask)) + atomic_inc(&__num_dying_cpus); + } + else { + if (cpumask_test_and_clear_cpu(cpu, &__cpu_dying_mask)) + atomic_dec(&__num_dying_cpus); + } } /** @@ -1073,6 +1079,11 @@ static __always_inline unsigned int num_online_cpus(void) { return arch_atomic_read(&__num_online_cpus); } + +static __always_inline unsigned int num_dying_cpus(void) +{ + return arch_atomic_read(&__num_dying_cpus); +} #define num_possible_cpus() cpumask_weight(cpu_possible_mask) #define num_present_cpus() cpumask_weight(cpu_present_mask) #define num_active_cpus() cpumask_weight(cpu_active_mask) @@ -1108,6 +1119,7 @@ static __always_inline bool cpu_dying(unsigned int cpu) #define num_possible_cpus() 1U #define num_present_cpus() 1U #define num_active_cpus() 1U +#define num_dying_cpus() 0U static __always_inline bool cpu_online(unsigned int cpu) { diff --git a/kernel/cpu.c b/kernel/cpu.c index f4a2c5845bcb..1c96c04cb259 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2662,6 +2662,8 @@ EXPORT_SYMBOL(__cpu_dying_mask); atomic_t __num_online_cpus __read_mostly; EXPORT_SYMBOL(__num_online_cpus); +atomic_t __num_dying_cpus __read_mostly; + void init_cpu_present(const struct cpumask *src) { cpumask_copy(&__cpu_present_mask, src); -- 2.31.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count 2023-04-06 1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin @ 2023-04-10 17:42 ` Yury Norov 2023-04-10 20:12 ` Thomas Gleixner 1 sibling, 0 replies; 8+ messages in thread From: Yury Norov @ 2023-04-10 17:42 UTC (permalink / raw) To: Ye Bin Cc: dennis, tj, cl, linux-mm, andriy.shevchenko, linux, linux-kernel, dchinner, yebin10 On Thu, Apr 06, 2023 at 09:56:28AM +0800, Ye Bin wrote: > From: Ye Bin <yebin10@huawei.com> > > Introduce '__num_dying_cpus' variable to cache the number of dying CPUs > in the core and just return the cached variable. > > Signed-off-by: Ye Bin <yebin10@huawei.com> It looks like you didn't address any comments for v1. Can you please do that? Otherwise, NAK. Thanks, Yury ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count 2023-04-06 1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin 2023-04-10 17:42 ` Yury Norov @ 2023-04-10 20:12 ` Thomas Gleixner 1 sibling, 0 replies; 8+ messages in thread From: Thomas Gleixner @ 2023-04-10 20:12 UTC (permalink / raw) To: Ye Bin, dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux Cc: linux-kernel, dchinner, yebin10, yebin On Thu, Apr 06 2023 at 09:56, Ye Bin wrote: > From: Ye Bin <yebin10@huawei.com> > > Introduce '__num_dying_cpus' variable to cache the number of dying CPUs > in the core and just return the cached variable. Why? That atomic counter is racy too if read and acted upon w/o having CPUs read locked. All it does is making the race window smaller vs. the cpumask_weight() based implementation. It's still racy and incorrect. So no, this is not going to happen. Thanks, tglx ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race 2023-04-06 1:56 [PATCH v2 0/2] fix dying cpu compare race Ye Bin 2023-04-06 1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin @ 2023-04-06 1:56 ` Ye Bin 2023-04-08 7:13 ` Dennis Zhou 2023-04-10 20:53 ` Thomas Gleixner 1 sibling, 2 replies; 8+ messages in thread From: Ye Bin @ 2023-04-06 1:56 UTC (permalink / raw) To: dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux Cc: linux-kernel, dchinner, yebin10, yebin From: Ye Bin <yebin10@huawei.com> In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race condition between a cpu dying and percpu_counter_sum() iterating online CPUs was identified. Acctually, there's the same race condition between a cpu dying and __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment. But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()', then maybe return incorrect result. To solve above issue, also need to add dying CPUs count when do quick judgment in __percpu_counter_compare(). Signed-off-by: Ye Bin <yebin10@huawei.com> --- lib/percpu_counter.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index 5004463c4f9f..399840cb0012 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu) return 0; } +static __always_inline unsigned int num_count_cpus(void) +{ +#ifdef CONFIG_HOTPLUG_CPU + return (num_online_cpus() + num_dying_cpus()); +#else + return num_online_cpus(); +#endif +} + /* * Compare counter against given value. * Return 1 if greater, 0 if equal and -1 if less @@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch) count = percpu_counter_read(fbc); /* Check to see if rough count will be sufficient for comparison */ - if (abs(count - rhs) > (batch * num_online_cpus())) { + if (abs(count - rhs) > (batch * num_count_cpus())) { if (count > rhs) return 1; else -- 2.31.1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race 2023-04-06 1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin @ 2023-04-08 7:13 ` Dennis Zhou 2023-04-10 20:53 ` Thomas Gleixner 1 sibling, 0 replies; 8+ messages in thread From: Dennis Zhou @ 2023-04-08 7:13 UTC (permalink / raw) To: Ye Bin Cc: tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux, linux-kernel, yebin10, dchinner Hello, On Thu, Apr 06, 2023 at 09:56:29AM +0800, Ye Bin wrote: > From: Ye Bin <yebin10@huawei.com> > > In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race > condition between a cpu dying and percpu_counter_sum() iterating online CPUs > was identified. > Acctually, there's the same race condition between a cpu dying and > __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment. > But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()', > then maybe return incorrect result. > To solve above issue, also need to add dying CPUs count when do quick judgment > in __percpu_counter_compare(). > I've thought a lot of about this since you've sent v1. For the general problem, you haven't addressed Dave's concerns from [1]. I agree you've found a valid race condition, but as Dave mentioned, there's no synchronization in __percpu_counter_compare() and consequently no guarantees about the accuracy of the value. However, I might be missing something, but I do think the use case in 5825bea05265 ("xfs: __percpu_counter_compare() inode count debug too expensive") is potentially valid. If the rhs is an expected lower bound or upper bound (depending on if you're counting up or down, but not both) and the count you're accounting has the same expectations as percpu_refcount (you can only subtract what you've already added), then should the percpu_counter_sum() ever be on the wrong side of rhs, that should be an error and visible via percpu_counter_compare(). I need to think a little longer, but my initial thought is while you close a race condition, the function itself is inherently vulnerable. [1] https://lore.kernel.org/lkml/ZCu9LtdA+NMrfG9x@rh/ Thanks, Dennis > Signed-off-by: Ye Bin <yebin10@huawei.com> > --- > lib/percpu_counter.c | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c > index 5004463c4f9f..399840cb0012 100644 > --- a/lib/percpu_counter.c > +++ b/lib/percpu_counter.c > @@ -227,6 +227,15 @@ static int percpu_counter_cpu_dead(unsigned int cpu) > return 0; > } > > +static __always_inline unsigned int num_count_cpus(void) > +{ > +#ifdef CONFIG_HOTPLUG_CPU > + return (num_online_cpus() + num_dying_cpus()); > +#else > + return num_online_cpus(); > +#endif > +} > + > /* > * Compare counter against given value. > * Return 1 if greater, 0 if equal and -1 if less > @@ -237,7 +246,7 @@ int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch) > > count = percpu_counter_read(fbc); > /* Check to see if rough count will be sufficient for comparison */ > - if (abs(count - rhs) > (batch * num_online_cpus())) { > + if (abs(count - rhs) > (batch * num_count_cpus())) { > if (count > rhs) > return 1; > else > -- > 2.31.1 > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race 2023-04-06 1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin 2023-04-08 7:13 ` Dennis Zhou @ 2023-04-10 20:53 ` Thomas Gleixner 2023-04-11 6:56 ` Thomas Gleixner 1 sibling, 1 reply; 8+ messages in thread From: Thomas Gleixner @ 2023-04-10 20:53 UTC (permalink / raw) To: Ye Bin, dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux Cc: linux-kernel, dchinner, yebin10, yebin, Peter Zijlstra, Valentin Schneider, Linus Torvalds On Thu, Apr 06 2023 at 09:56, Ye Bin wrote: > From: Ye Bin <yebin10@huawei.com> > > In commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") a race > condition between a cpu dying and percpu_counter_sum() iterating online CPUs > was identified. > Acctually, there's the same race condition between a cpu dying and > __percpu_counter_compare(). Here, use 'num_online_cpus()' for quick judgment. > But 'num_online_cpus()' will be decreased before call 'percpu_counter_cpu_dead()', > then maybe return incorrect result. > To solve above issue, also need to add dying CPUs count when do quick judgment > in __percpu_counter_compare(). This is all bogus including the above commit. All CPU masks including cpu_online_mask and cpu_dying_mask are only valid when the context is: - A CPU hotplug callback - Any other code which holds cpu_hotplug_lock read or write locked. cpu_online_mask is special in that regard. It is also protected against modification in any preemption disabled region, but that's a pure implementation detail. cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being held. And even with that lock held any cpumask operation on it is silly. The mask is a core detail: commit e40f74c535b8 "cpumask: Introduce DYING mask" Introduce a cpumask that indicates (for each CPU) what direction the CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when an up fails and we do a roll-back down, it will accurately reflect the direction. It does not tell anything to a user which is not aware of the actual hotplug state machine state. The real reason for this percpu counter issue is how percpu counter hotplug callbacks are implemented: They are asymmetric and at the completely wrong place. The above 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") was done via XFS and without the courtesy of CC'ing the people who care about the CPU hotplug core. The lenghty analysis of this commit is all shiny, but fundamentally wrong. See above. I'm way too tired to come up with a proper fix for this mess, but I'm going to stare at it tomorrow morning with brain awake. Thanks, tglx ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race 2023-04-10 20:53 ` Thomas Gleixner @ 2023-04-11 6:56 ` Thomas Gleixner 0 siblings, 0 replies; 8+ messages in thread From: Thomas Gleixner @ 2023-04-11 6:56 UTC (permalink / raw) To: Ye Bin, dennis, tj, cl, linux-mm, yury.norov, andriy.shevchenko, linux Cc: linux-kernel, dchinner, yebin10, yebin, Peter Zijlstra, Valentin Schneider, Linus Torvalds On Mon, Apr 10 2023 at 22:53, Thomas Gleixner wrote: > On Thu, Apr 06 2023 at 09:56, Ye Bin wrote: > cpu_dying_mask is a random number generator w/o cpu_hotplug_lock being > held. And even with that lock held any cpumask operation on it is silly. > The mask is a core detail: > > commit e40f74c535b8 "cpumask: Introduce DYING mask" > > Introduce a cpumask that indicates (for each CPU) what direction the > CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when > an up fails and we do a roll-back down, it will accurately reflect the > direction. > > It does not tell anything to a user which is not aware of the actual > hotplug state machine state. Even if the mask is most of the time stable, it's a total disaster performance wise. The bits in cpu_dying_mask are sticky until the next online operation. So for a system which has SMT enabled in BIOS, but SMT is disabled on the kernel command line or later via the sysfs knob, this means that the loop in __percpu_counter_sum() will iterate over all shutdown SMT siblings forever. IOW, it will touch nr_of_shutdown_cpus() cachelines for absolutely zero reason. The same applies for the proposed counter. Thanks, tglx ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-04-11 6:56 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-04-06 1:56 [PATCH v2 0/2] fix dying cpu compare race Ye Bin 2023-04-06 1:56 ` [PATCH v2 1/2] cpu/hotplug: introduce 'num_dying_cpus' to get dying CPUs count Ye Bin 2023-04-10 17:42 ` Yury Norov 2023-04-10 20:12 ` Thomas Gleixner 2023-04-06 1:56 ` [PATCH v2 2/2] lib/percpu_counter: fix dying cpu compare race Ye Bin 2023-04-08 7:13 ` Dennis Zhou 2023-04-10 20:53 ` Thomas Gleixner 2023-04-11 6:56 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox