From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2C62EC77B73
	for <linux-mm@archiver.kernel.org>; Mon, 17 Apr 2023 02:09:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4B4A78E0002; Sun, 16 Apr 2023 22:09:59 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 43D7B8E0001; Sun, 16 Apr 2023 22:09:59 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2B7838E0002; Sun, 16 Apr 2023 22:09:59 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 1A63F8E0001
	for <linux-mm@kvack.org>; Sun, 16 Apr 2023 22:09:59 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id E01111403E5
	for <linux-mm@kvack.org>; Mon, 17 Apr 2023 02:09:58 +0000 (UTC)
X-FDA: 80689252476.12.FB6DF12
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf11.hostedemail.com (Postfix) with ESMTP id D8F944000A
	for <linux-mm@kvack.org>; Mon, 17 Apr 2023 02:09:56 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LMVgDDWV;
	spf=pass (imf11.hostedemail.com: domain of dchinner@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=dchinner@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1681697397;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=CLjKnf5uknZ4P+fwmv5E2U11bhOC2FIN/rpzg2Ro8Ds=;
	b=egIarA+v8cNC/3jqRN+6EevAYRRBAU51weJXAFWanvZ69bjG3CRXq0OQkP8o3omnY5Klu3
	llWhuzQZH5OFSaL3xgrXpkJOaUymUcHZ6qro/rvoatgN2GmnsRTbpMs0LZOapHgPeLi9dl
	OhuHB4imKJqs/jsfAa84XqSFSNFgQIA=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LMVgDDWV;
	spf=pass (imf11.hostedemail.com: domain of dchinner@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=dchinner@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681697397; a=rsa-sha256;
	cv=none;
	b=ENkPBdNghFIx7+XANrLytFQCel/9lcK3NGTGrF7bfeFhVNnmaTwS9qQewnQyUVi7yjnjSU
	x2rbSHkiSbRgYS5I1vjCNXVrt4JwQ6n8xgQTJYsBwEXDKRMMS6k1QA1nRM+0lGCiCuYoIx
	pneYQGHLDgJWHOmCvWSqSN/xq/DHA0Q=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1681697396;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=CLjKnf5uknZ4P+fwmv5E2U11bhOC2FIN/rpzg2Ro8Ds=;
	b=LMVgDDWV2LMfWzo9JSKQAdf/BtC8WChyZLcIxN7QlkCNruJVaPl0zIxEFcq2xi4NtPHMO1
	VFftvbr9KjIbRik3MJSkqMwFOH10H1CUFUHoRsfEVYHX8vZjCW1jL/LET3dZcW8fymmyk0
	I8z0EAgI9mVA8vYcKL110bVDe6Ktrh4=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-3-fpeBPTchMlmZpa5tuOjigw-1; Sun, 16 Apr 2023 22:09:44 -0400
X-MC-Unique: fpeBPTchMlmZpa5tuOjigw-1
Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 54381101A551;
	Mon, 17 Apr 2023 02:09:44 +0000 (UTC)
Received: from rh (vpn2-52-17.bne.redhat.com [10.64.52.17])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 3821C492B03;
	Mon, 17 Apr 2023 02:09:43 +0000 (UTC)
Received: from localhost ([::1] helo=rh)
	by rh with esmtps  (TLS1.3) tls TLS_AES_256_GCM_SHA384
	(Exim 4.96)
	(envelope-from <dchinner@redhat.com>)
	id 1poEJH-0027rH-2H;
	Mon, 17 Apr 2023 12:09:39 +1000
Date: Mon, 17 Apr 2023 12:09:37 +1000
From: Dave Chinner <dchinner@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>, Yury Norov <yury.norov@gmail.com>,
	Andy Shevchenko <andriy.shevchenko@linux.intel.com>,
	Rasmus Villemoes <linux@rasmusvillemoes.dk>,
	Ye Bin <yebin10@huawei.com>, linux-mm@kvack.org
Subject: Re: [patch 1/3] lib/percpu_counter: Fix CPU hotplug handling
Message-ID: <ZDyqYYq+HgxCKri4@rh>
References: <20230414162755.281993820@linutronix.de>
 <20230414162841.166896739@linutronix.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20230414162841.166896739@linutronix.de>
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Stat-Signature: 59ys8geoc8da4z8cqu4csu9pc63gp1jj
X-Rspamd-Queue-Id: D8F944000A
X-HE-Tag: 1681697396-423612
X-HE-Meta: U2FsdGVkX19uv7dVH5AoVTCJWck8A73qgb5Nof/yWOV+XLYhr6U5pfCQeIyzAmxcott1x9F4bF52QNAqnMidsl5kW5XgnPtWHzDg74WRj1H5bnhxcs8UZ2KijThllO6HdZ4NVr5z32PjgDukWJAA8cmJLpoaLmneMqPCkiB0BjNSmCD5muic0jyl33EGj1ENNFW0VCXiSot+ZBu2IJfGosSoUJj/FXvvczEFgv7AInMCdv/NrHfs1sBQmgdClAA6OG74Yp2vAik/pq7St621udr4z7RXaHITwbbrX4cJbnSKhmefFKgs5R8bQdwD2LfMtnA+NveH2kSuzJn0sK6BaGEXOJSYNCludlTFCIvLg4H0/m3kKDq+ASDa6U4n+9i+SbCykjlke91xR2/KJfEYnHyp4oY9v0kwbuEueDcVr2MnuY4Zoawaql4bUtA71/2OTGZ7vZTi/MVcWlYeYrhe+lQs5T73N3WMnU3C/t4ipl8NEHtzsjh3bx+uReTWF15rA14Mlj1xX+85xTS/wkeDGPNde9MLn7hfsYVfP+CYZW479f2MjCfgxzBRKnTG0PWxKlHrg8rfrycA5h17nuL8HQKcvt726bEYuKCjWhkhh72OGJ0TAAA/dw87xMQV8fOigo1kqoGmjSMzOsfomyFuNTvUcUZsILl0hEo67q2pzm3JdBQ5nl8exb1n2hpRiXDq4UjC/9nidatCGcMVgb6X25RzGIv7CsRAWElzP0xgdrao6DxPQ4Y0h0OPs5GuBr0cutRsmB1yrt62D/jVFmJG/PLO+ttr6Fg8Ir7Ozvh1TkyeUXdez0rJ6kziQ3zjycKDlN427A6BnfE6jSlWE9JSF6gs0yT5zM50KfSMVRDieDgj7AxQpqAgXqrOGx3gdxboKWg3nX8UbijS8VMB9EEUTQY+jYPMcAG2eGsNPxno/TT+al2s409lDfTiE51w4S5mzq8W45B2iHah+4S4M/e
 5WUHg/aQ
 V0Ky87/4NIwkJ4m7JxJYSSYO8A/tLHh4W98G0W4Rj9ESR/YMl/GbIJUP7e+D819ihU+NGffyP4Yxp7jjX2K0U2TcuSm+jAFRj4ANdtEUYbdIuYGG/pVmrVbJbawfhLykDrf/Ss1wcZjBRxZuMqK/K0AkxVKsk3NNmWofuDiDDgVgAzUaPm6RJYcXOQ8mbsRMJIYo+734sMjh8Q/R6ks2f+zDMuaX4QgC2yneH5ESIUN9UI86Qo2FKZ/xUhJCZoHeC3gfZ58/u8mkf20veKVS7/YUA1PoGjCAhxDqwPIXYYGULLDalIjApDviIFliOxB8oKi4kUGiMbQHBlod9PHhU7k6UsCGxhbwRcJ+Zp+wAIcinLUvb3CDwctIEuMdWFCL9SHTXpfqwrsYC++9rx1d6/ypa79W4luBvFjo+gl7ZvAYlEp7AHpwAtR0ixBWVxWmFzPSUMqs13pXbDeVTiBn/u8FMXnvi9O94AVlXAub7USiaGJ9QPE8V3gCaW9ul9BEFlqpxG7y2BVmbJRmx99Ar5QgbDxd4xHohDZrcv5fSOchzNGlsUOgd7l/HSBjvkKSdMBiiJIRvny2GEWOMlwAiRwd2BQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Apr 14, 2023 at 06:30:43PM +0200, Thomas Gleixner wrote:
> Commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") tried to
> address a race condition between percpu_counter_sum() and a concurrent CPU
> hotplug operation.
> 
> The race window is between the point where an un-plugged CPU removed itself
> from the online_cpu_mask and the hotplug state callback which folds the per
> CPU counters of the now dead CPU into the global count.
> 
> percpu_counter_sum() used for_each_online_cpu() to accumulate the per CPU
> local counts, so during the race window it missed to account for the not
> yet folded back local count of the offlined CPU.
> 
> The attempt to address this used the admittedly undocumented and
> pointlessly public cpu_dying_mask by changing the loop iterator to take
> both the cpu_online_mask and the cpu_dying_mask into account.
> 
> That works to some extent, but it is incorrect.
> 
> The cpu_dying_mask bits are sticky even after cpu_up()/cpu_down()
> completes. That means that all offlined CPUs are always taken into
> account. In the case of disabling SMT at boottime or runtime this results
> in evaluating _all_ offlined SMT siblings counters forever.  Depending on
> system size, that's a massive amount of cache-lines to be touched forever.
> 
> It might be argued, that the cpu_dying_mask bit could be cleared when
> cpu_down() completes, but that's not possible under all circumstances.
> 
> Especially with partial hotplug the bit must be sticky in order to keep the
> initial user, i.e. the scheduler correct. Partial hotplug which allows
> explicit state transitions also can create a situation where the race
> window gets recreated:
> 
>        cpu_down(target = CPUHP_PERCPU_CNT_DEAD + 1)
> 
> brings a CPU down to one state before the per CPU counter folding
> callback. As this did not reach CPUHP_OFFLINE state the bit would stay set.
> Now the next partial operation:
> 
>        cpu_up(target = CPUHP_PERCPU_CNT_DEAD + 2)
> 
> has to clear the bit and the race window is open again.
> 
> There are two ways to solve this:
> 
>   1) Maintain a local CPU mask in the per CPU counter code which
>      gets the bit set when a CPU comes online and removed in the
>      the CPUHP_PERCPU_CNT_DEAD state after folding.
> 
>      This adds more code and complexity.
> 
>   2) Move the folding hotplug state into the DYING callback section, which
>      runs on the outgoing CPU immediatedly after it cleared its online bit.
> 
>      There is no concurrency vs. percpu_counter_sum() on another CPU
>      because all still online CPUs are waiting in stop_machine() for the
>      outgoing CPU to complete its shutdown. The raw spinlock held around
>      the CPU mask iteration prevents that an online CPU reaches the stop
>      machine thread while iterating, which implicitely prevents the
>      outgoing CPU from clearing its online bit.
> 
>      This is way simpler than #1 and makes the hotplug calls symmetric for
>      the price of a slightly longer wait time in stop_machine(), which is
>      not the end of the world as CPU un-plug is already slow. The overall
>      time for a cpu_down() operation stays exactly the same.
> 
> Implement #2 and plug the race completely.
> 
> percpu_counter_sum() is still inherently racy against a concurrent
> percpu_counter_add_batch() fastpath unless externally serialized.  That's
> completely independent of CPU hotplug though.
> 
> Fixes: 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race")
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dennis Zhou <dennis@kernel.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Yury Norov <yury.norov@gmail.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Cc: Ye Bin <yebin10@huawei.com>
> Cc: linux-mm@kvack.org
> ---
>  include/linux/cpuhotplug.h |    2 -
>  lib/percpu_counter.c       |   57 +++++++++++++++++++--------------------------
>  2 files changed, 26 insertions(+), 33 deletions(-)
> 
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -91,7 +91,6 @@ enum cpuhp_state {
>  	CPUHP_PRINTK_DEAD,
>  	CPUHP_MM_MEMCQ_DEAD,
>  	CPUHP_XFS_DEAD,
> -	CPUHP_PERCPU_CNT_DEAD,
>  	CPUHP_RADIX_DEAD,
>  	CPUHP_PAGE_ALLOC,
>  	CPUHP_NET_DEV_DEAD,
> @@ -196,6 +195,7 @@ enum cpuhp_state {
>  	CPUHP_AP_SMPCFD_DYING,
>  	CPUHP_AP_X86_TBOOT_DYING,
>  	CPUHP_AP_ARM_CACHE_B15_RAC_DYING,
> +	CPUHP_AP_PERCPU_COUNTER_STARTING,
>  	CPUHP_AP_ONLINE,
>  	CPUHP_TEARDOWN_CPU,
>  
> --- a/lib/percpu_counter.c
> +++ b/lib/percpu_counter.c
> @@ -12,7 +12,7 @@
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  static LIST_HEAD(percpu_counters);
> -static DEFINE_SPINLOCK(percpu_counters_lock);
> +static DEFINE_RAW_SPINLOCK(percpu_counters_lock);
>  #endif
>  
>  #ifdef CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER
> @@ -126,13 +126,8 @@ EXPORT_SYMBOL(percpu_counter_sync);
>   * Add up all the per-cpu counts, return the result.  This is a more accurate
>   * but much slower version of percpu_counter_read_positive().
>   *
> - * We use the cpu mask of (cpu_online_mask | cpu_dying_mask) to capture sums
> - * from CPUs that are in the process of being taken offline. Dying cpus have
> - * been removed from the online mask, but may not have had the hotplug dead
> - * notifier called to fold the percpu count back into the global counter sum.
> - * By including dying CPUs in the iteration mask, we avoid this race condition
> - * so __percpu_counter_sum() just does the right thing when CPUs are being taken
> - * offline.
> + * Note: This function is inherently racy against the lockless fastpath of
> + * percpu_counter_add_batch() unless externaly serialized.
>   */
>  s64 __percpu_counter_sum(struct percpu_counter *fbc)
>  {
> @@ -142,10 +137,8 @@ s64 __percpu_counter_sum(struct percpu_c
>  
>  	raw_spin_lock_irqsave(&fbc->lock, flags);
>  	ret = fbc->count;
> -	for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> -		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
> -		ret += *pcount;
> -	}
> +	for_each_online_cpu(cpu)
> +		ret += *per_cpu_ptr(fbc->counters, cpu);
>  	raw_spin_unlock_irqrestore(&fbc->lock, flags);
>  	return ret;
>  }
> @@ -167,9 +160,9 @@ int __percpu_counter_init(struct percpu_
>  
>  #ifdef CONFIG_HOTPLUG_CPU
>  	INIT_LIST_HEAD(&fbc->list);
> -	spin_lock_irqsave(&percpu_counters_lock, flags);
> +	raw_spin_lock_irqsave(&percpu_counters_lock, flags);
>  	list_add(&fbc->list, &percpu_counters);
> -	spin_unlock_irqrestore(&percpu_counters_lock, flags);
> +	raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
>  #endif
>  	return 0;
>  }
> @@ -185,9 +178,9 @@ void percpu_counter_destroy(struct percp
>  	debug_percpu_counter_deactivate(fbc);
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> -	spin_lock_irqsave(&percpu_counters_lock, flags);
> +	raw_spin_lock_irqsave(&percpu_counters_lock, flags);
>  	list_del(&fbc->list);
> -	spin_unlock_irqrestore(&percpu_counters_lock, flags);
> +	raw_spin_unlock_irqrestore(&percpu_counters_lock, flags);
>  #endif
>  	free_percpu(fbc->counters);
>  	fbc->counters = NULL;
> @@ -197,22 +190,29 @@ EXPORT_SYMBOL(percpu_counter_destroy);
>  int percpu_counter_batch __read_mostly = 32;
>  EXPORT_SYMBOL(percpu_counter_batch);
>  
> -static int compute_batch_value(unsigned int cpu)
> +static void compute_batch_value(int offs)
>  {
> -	int nr = num_online_cpus();
> +	int nr = num_online_cpus() + offs;
> +
> +	percpu_counter_batch = max(32, nr * 2);
> +}
>  
> -	percpu_counter_batch = max(32, nr*2);
> +static int percpu_counter_cpu_starting(unsigned int cpu)
> +{
> +	/* If invoked during hotplug @cpu is not yet marked online. */
> +	compute_batch_value(cpu_online(cpu) ? 0 : 1);
>  	return 0;
>  }

So this changes the batch size based on whether the CPU is starting
or dying to try to get _compare() to fall into the slow path correctly?

How is this supposed to work with counters that have caller supplied
custom batch sizes? i.e. use percpu_counter_add_batch() and
__percpu_counter_compare() with their own batch sizes directly?
Do they now need to add their own cpu hotplug hooks to
screw around with their batch sizes as well?

-Dave.
-- 
Dave Chinner
dchinner@redhat.com