From: Dennis Zhou <dennis@kernel.org>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Valentin Schneider <vschneid@redhat.com>,
Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>,
Dave Chinner <dchinner@redhat.com>,
Yury Norov <yury.norov@gmail.com>,
Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
Rasmus Villemoes <linux@rasmusvillemoes.dk>,
Ye Bin <yebin10@huawei.com>,
linux-mm@kvack.org
Subject: Re: [patch 1/3] lib/percpu_counter: Fix CPU hotplug handling
Date: Fri, 14 Apr 2023 22:20:04 -0700 [thread overview]
Message-ID: <ZDo0BHfm+iEGQyEX@snowbird> (raw)
In-Reply-To: <20230414162841.166896739@linutronix.de>
Hello,
On Fri, Apr 14, 2023 at 06:30:43PM +0200, Thomas Gleixner wrote:
> Commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") tried to
> address a race condition between percpu_counter_sum() and a concurrent CPU
> hotplug operation.
>
> The race window is between the point where an un-plugged CPU removed itself
> from the online_cpu_mask and the hotplug state callback which folds the per
> CPU counters of the now dead CPU into the global count.
>
> percpu_counter_sum() used for_each_online_cpu() to accumulate the per CPU
> local counts, so during the race window it missed to account for the not
> yet folded back local count of the offlined CPU.
>
> The attempt to address this used the admittedly undocumented and
> pointlessly public cpu_dying_mask by changing the loop iterator to take
> both the cpu_online_mask and the cpu_dying_mask into account.
>
> That works to some extent, but it is incorrect.
>
> The cpu_dying_mask bits are sticky even after cpu_up()/cpu_down()
> completes. That means that all offlined CPUs are always taken into
> account. In the case of disabling SMT at boottime or runtime this results
> in evaluating _all_ offlined SMT siblings counters forever. Depending on
> system size, that's a massive amount of cache-lines to be touched forever.
>
> It might be argued, that the cpu_dying_mask bit could be cleared when
> cpu_down() completes, but that's not possible under all circumstances.
>
> Especially with partial hotplug the bit must be sticky in order to keep the
> initial user, i.e. the scheduler correct. Partial hotplug which allows
> explicit state transitions also can create a situation where the race
> window gets recreated:
>
> cpu_down(target = CPUHP_PERCPU_CNT_DEAD + 1)
>
> brings a CPU down to one state before the per CPU counter folding
> callback. As this did not reach CPUHP_OFFLINE state the bit would stay set.
> Now the next partial operation:
>
> cpu_up(target = CPUHP_PERCPU_CNT_DEAD + 2)
>
> has to clear the bit and the race window is open again.
>
> There are two ways to solve this:
>
> 1) Maintain a local CPU mask in the per CPU counter code which
> gets the bit set when a CPU comes online and removed in the
> the CPUHP_PERCPU_CNT_DEAD state after folding.
>
> This adds more code and complexity.
>
> 2) Move the folding hotplug state into the DYING callback section, which
> runs on the outgoing CPU immediatedly after it cleared its online bit.
>
> There is no concurrency vs. percpu_counter_sum() on another CPU
> because all still online CPUs are waiting in stop_machine() for the
> outgoing CPU to complete its shutdown. The raw spinlock held around
> the CPU mask iteration prevents that an online CPU reaches the stop
> machine thread while iterating, which implicitely prevents the
> outgoing CPU from clearing its online bit.
>
> This is way simpler than #1 and makes the hotplug calls symmetric for
> the price of a slightly longer wait time in stop_machine(), which is
> not the end of the world as CPU un-plug is already slow. The overall
> time for a cpu_down() operation stays exactly the same.
>
> Implement #2 and plug the race completely.
>
> percpu_counter_sum() is still inherently racy against a concurrent
> percpu_counter_add_batch() fastpath unless externally serialized. That's
> completely independent of CPU hotplug though.
>
> Fixes: 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race")
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dennis Zhou <dennis@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Yury Norov <yury.norov@gmail.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Cc: Ye Bin <yebin10@huawei.com>
> Cc: linux-mm@kvack.org
> ---
> include/linux/cpuhotplug.h | 2 -
> lib/percpu_counter.c | 57 +++++++++++++++++++--------------------------
> 2 files changed, 26 insertions(+), 33 deletions(-)
>
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -91,7 +91,6 @@ enum cpuhp_state {
> CPUHP_PRINTK_DEAD,
> CPUHP_MM_MEMCQ_DEAD,
> CPUHP_XFS_DEAD,
> - CPUHP_PERCPU_CNT_DEAD,
> CPUHP_RADIX_DEAD,
> CPUHP_PAGE_ALLOC,
> CPUHP_NET_DEV_DEAD,
> @@ -196,6 +195,7 @@ enum cpuhp_state {
> CPUHP_AP_SMPCFD_DYING,
> CPUHP_AP_X86_TBOOT_DYING,
> CPUHP_AP_ARM_CACHE_B15_RAC_DYING,
> + CPUHP_AP_PERCPU_COUNTER_STARTING,
> CPUHP_AP_ONLINE,
> CPUHP_TEARDOWN_CPU,
>
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -12,7 +12,7 @@
>
> #ifdef CONFIG_HOTPLUG_CPU
> static LIST_HEAD(percpu_counters);
> -static DEFINE_SPINLOCK(percpu_counters_lock);
> +static DEFINE_RAW_SPINLOCK(percpu_counters_lock);
> #endif
>
> #ifdef CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER
> @@ -126,13 +126,8 @@ EXPORT_SYMBOL(percpu_counter_sync);
> * Add up all the per-cpu counts, return the result. This is a more accurate
> * but much slower version of percpu_counter_read_positive().
> *
> - * We use the cpu mask of (cpu_online_mask | cpu_dying_mask) to capture sums
> - * from CPUs that are in the process of being taken offline. Dying cpus have
> - * been removed from the online mask, but may not have had the hotplug dead
> - * notifier called to fold the percpu count back into the global counter sum.
> - * By including dying CPUs in the iteration mask, we avoid this race condition
> - * so __percpu_counter_sum() just does the right thing when CPUs are being taken
> - * offline.
> + * Note: This function is inherently racy against the lockless fastpath of
> + * percpu_counter_add_batch() unless externaly serialized.
> */
> s64 __percpu_counter_sum(struct percpu_counter *fbc)
> {
> @@ -142,10 +137,8 @@ s64 __percpu_counter_sum(struct percpu_c
>
> raw_spin_lock_irqsave(&fbc->lock, flags);
> ret = fbc->count;
> - for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> - s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
> - ret += *pcount;
> - }
> + for_each_online_cpu(cpu)
> + ret += *per_cpu_ptr(fbc->counters, cpu);
> raw_spin_unlock_irqrestore(&fbc->lock, flags);
> return ret;
> }
> @@ -167,9 +160,9 @@ int __percpu_counter_init(struct percpu_
>
> #ifdef CONFIG_HOTPLUG_CPU
> INIT_LIST_HEAD(&fbc->list);
> - spin_lock_irqsave(&percpu_counters_lock, flags);
> + raw_spin_lock_irqsave(&percpu_counters_lock, flags);
> list_add(&fbc->list, &percpu_counters);
> - spin_unlock_irqrestore(&percpu_counters_lock, flags);
> + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
> #endif
> return 0;
> }
> @@ -185,9 +178,9 @@ void percpu_counter_destroy(struct percp
> debug_percpu_counter_deactivate(fbc);
>
> #ifdef CONFIG_HOTPLUG_CPU
> - spin_lock_irqsave(&percpu_counters_lock, flags);
> + raw_spin_lock_irqsave(&percpu_counters_lock, flags);
> list_del(&fbc->list);
> - spin_unlock_irqrestore(&percpu_counters_lock, flags);
> + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
> #endif
> free_percpu(fbc->counters);
> fbc->counters = NULL;
> @@ -197,22 +190,29 @@ EXPORT_SYMBOL(percpu_counter_destroy);
> int percpu_counter_batch __read_mostly = 32;
> EXPORT_SYMBOL(percpu_counter_batch);
>
> -static int compute_batch_value(unsigned int cpu)
> +static void compute_batch_value(int offs)
> {
> - int nr = num_online_cpus();
> + int nr = num_online_cpus() + offs;
> +
> + percpu_counter_batch = max(32, nr * 2);
> +}
>
> - percpu_counter_batch = max(32, nr*2);
> +static int percpu_counter_cpu_starting(unsigned int cpu)
> +{
> + /* If invoked during hotplug @cpu is not yet marked online. */
> + compute_batch_value(cpu_online(cpu) ? 0 : 1);
> return 0;
> }
>
> -static int percpu_counter_cpu_dead(unsigned int cpu)
> +static int percpu_counter_cpu_dying(unsigned int cpu)
> {
> #ifdef CONFIG_HOTPLUG_CPU
> struct percpu_counter *fbc;
> + unsigned long flags;
>
> - compute_batch_value(cpu);
> + compute_batch_value(0);
>
> - spin_lock_irq(&percpu_counters_lock);
> + raw_spin_lock_irqsave(&percpu_counters_lock, flags);
> list_for_each_entry(fbc, &percpu_counters, list) {
> s32 *pcount;
>
> @@ -222,7 +222,7 @@ static int percpu_counter_cpu_dead(unsig
> *pcount = 0;
> raw_spin_unlock(&fbc->lock);
> }
> - spin_unlock_irq(&percpu_counters_lock);
> + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
> #endif
> return 0;
> }
> @@ -256,15 +256,8 @@ EXPORT_SYMBOL(__percpu_counter_compare);
>
> static int __init percpu_counter_startup(void)
> {
> - int ret;
> -
> - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "lib/percpu_cnt:online",
> - compute_batch_value, NULL);
> - WARN_ON(ret < 0);
> - ret = cpuhp_setup_state_nocalls(CPUHP_PERCPU_CNT_DEAD,
> - "lib/percpu_cnt:dead", NULL,
> - percpu_counter_cpu_dead);
> - WARN_ON(ret < 0);
> + WARN_ON(cpuhp_setup_state(CPUHP_AP_PERCPU_COUNTER_STARTING, "lib/percpu_counter:starting",
> + percpu_counter_cpu_starting, percpu_counter_cpu_dying));
> return 0;
> }
> module_init(percpu_counter_startup);
>
Thanks for this work. This is a much more complete solution.
Acked-by: Dennis Zhou <dennis@kernel.org>
Thanks,
Dennis
next prev parent reply other threads:[~2023-04-15 5:20 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-14 16:30 [patch 0/3] lib/percpu_counter, cpu/hotplug: Cure the cpu_dying_mask woes Thomas Gleixner
2023-04-14 16:30 ` [patch 1/3] lib/percpu_counter: Fix CPU hotplug handling Thomas Gleixner
2023-04-15 5:20 ` Dennis Zhou [this message]
2023-04-17 2:09 ` Dave Chinner
2023-04-17 8:09 ` Thomas Gleixner
2023-04-14 16:30 ` [patch 2/3] cpu/hotplug: Remove export of cpu_active_mask and cpu_dying_mask Thomas Gleixner
2023-04-14 16:30 ` [patch 3/3] cpu/hotplug: Get rid of cpu_dying_mask Thomas Gleixner
2023-05-03 11:50 ` [patch 0/3] lib/percpu_counter, cpu/hotplug: Cure the cpu_dying_mask woes Valentin Schneider
2023-12-30 22:39 ` Dennis Zhou
2024-01-12 23:36 [PATCH 0/3 v2] " Dennis Zhou
2024-01-12 23:36 ` [PATCH 1/3] lib/percpu_counter: Fix CPU hotplug handling Dennis Zhou
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZDo0BHfm+iEGQyEX@snowbird \
--to=dennis@kernel.org \
--cc=andriy.shevchenko@linux.intel.com \
--cc=cl@linux.com \
--cc=dchinner@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@rasmusvillemoes.dk \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vschneid@redhat.com \
--cc=yebin10@huawei.com \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox