From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 38777C7619A for ; Sat, 15 Apr 2023 05:20:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7C126900003; Sat, 15 Apr 2023 01:20:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 77103900002; Sat, 15 Apr 2023 01:20:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 61173900003; Sat, 15 Apr 2023 01:20:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4F1C4900002 for ; Sat, 15 Apr 2023 01:20:13 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 00C5A1205F1 for ; Sat, 15 Apr 2023 05:20:12 +0000 (UTC) X-FDA: 80682474306.12.86979D1 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) by imf24.hostedemail.com (Postfix) with ESMTP id 2E74618000A for ; Sat, 15 Apr 2023 05:20:09 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=none); spf=pass (imf24.hostedemail.com: domain of dennisszhou@gmail.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=dennisszhou@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681536010; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zruA2qAmGh4CM+gfc55bRvR0bjxsCVmvjsUPo7Fqk2o=; b=jmvQW+V00riL6+bmaM0WjYH5ZoHF9Y3iD4C8uVCQMITRqbAMQdc+xq9MEPq3lNBwUKv3+q OPIzMr+CyYjh/Bv1saw5NpSjSj8D3UO2zbV7h7v+gVEhssK8DAgtHI7gqYmt2m/U2Xj31F EFJ+ERxm/t+5EblHYMvSR32B6qR6Gx0= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=none); spf=pass (imf24.hostedemail.com: domain of dennisszhou@gmail.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=dennisszhou@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681536010; a=rsa-sha256; cv=none; b=uBG5NjFTtUcNWVb5Z56Q+NKrHsv1gaAaPyHvZivs4VvsRgL2DPVc3MGK4DIESNSDv70bGG iHJTkKCwTCzttkodAO0fBWe4rNCd0EvYOjEYQCt64AFI2B0hge6H0GmxZ6gb2lX2Pj8vgo bcLnEbg2VNVf3gdF4LHlSJ/rXmro0/s= Received: by mail-pj1-f47.google.com with SMTP id fw22-20020a17090b129600b00247255b2f40so6748347pjb.1 for ; Fri, 14 Apr 2023 22:20:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681536009; x=1684128009; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=zruA2qAmGh4CM+gfc55bRvR0bjxsCVmvjsUPo7Fqk2o=; b=QdcEwYyUSMOsiJgqYMnCJs7S7mPwIzECoiNvrWPOKyUpL9mR0V5LbBm7EKvuyyAP8r UdAU4tVHMo2QZs8c1xs245PjmgDQokWL0cn8cPSlSmhV69CbqGh5dRzDUWhaC0s1/58U 6MVNk6ZewVzrLjBiWQGqEPR2cYvhWiYUef6zmYfhvwAxs1SZkdgqmeu2hwA9Uz7sY87v MDiEuvT5GmrSdQWwaQsVytUTLbZHEdvlBqdYPPze2cpopop1JjjcALUaUsgCDMnp7U17 MDW2Yz1bl8lCU//p+f1Nf5sEkmHBAiEt1d9uOrZBdWnTW+FeU9dN6ALRSAkkWtZn5NUY XunQ== X-Gm-Message-State: AAQBX9eSf9f4F7jtbnfaXEAoz609mDMDpAvuEi3xVj2pbyhZcmtyUWCn pgOe82b+466t2hbtuDyBj5E= X-Google-Smtp-Source: AKy350bFs3SM/8SAi6IgHGRaH1LrbDqrokI9AwBu8QHlXMdCoBaWwmuvkYIOqFX6nvP4STELaB1yyg== X-Received: by 2002:a17:902:ebc5:b0:1a6:8405:f709 with SMTP id p5-20020a170902ebc500b001a68405f709mr5399474plg.20.1681536008814; Fri, 14 Apr 2023 22:20:08 -0700 (PDT) Received: from snowbird ([2600:6c4e:237f:f314::916]) by smtp.gmail.com with ESMTPSA id t8-20020a17090ae50800b0024752ff8061sm660740pjy.12.2023.04.14.22.20.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Apr 2023 22:20:08 -0700 (PDT) Date: Fri, 14 Apr 2023 22:20:04 -0700 From: Dennis Zhou To: Thomas Gleixner Cc: LKML , Peter Zijlstra , Valentin Schneider , Tejun Heo , Christoph Lameter , Dave Chinner , Yury Norov , Andy Shevchenko , Rasmus Villemoes , Ye Bin , linux-mm@kvack.org Subject: Re: [patch 1/3] lib/percpu_counter: Fix CPU hotplug handling Message-ID: References: <20230414162755.281993820@linutronix.de> <20230414162841.166896739@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230414162841.166896739@linutronix.de> X-Rspamd-Queue-Id: 2E74618000A X-Stat-Signature: xjhdrdczzqn61ny9xpe1bofd1e67j1to X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1681536009-495969 X-HE-Meta: U2FsdGVkX1/O3Sx7HFLzyWatCi0e45N6vjcT3IDE1XkOV7h44ed2BvjU+09W9cDQb1do4CqimwCu9Oa/dZBGtV1zJ/F3/4r9xVaPyN5F8Hcsp9my4Inpr/VxGu95ktDPr8S1oYsNtWX6QvLQ5oqPw1GF1DPk4YJ2rCkVHVVRx/J4HT5KQXiRY/hEGNJovc3JlnpyajU0LnLL1BP/9k2mBWfkpFIOG+PP1+6mPJ4HAPddXW+x+0RqgVMNi0WJGogPLcvV5TjyZedwV0oc47yaxCTmu1bXj1GodDl1sN1p11oiayEJnBtyrjy5CRCXOYrMI4hzyufjOthr1CDvff0gpvIgWFPZ/GFbb8ryspnOwpBkMvFIcAR9JAWTzGOiMTZGR1VzQHMv/aXKSxEs80xIDoSKO+0GXhvVwUsFu6xUUm6ebPw2IR+ymQW7lsDwcm1RnAf2zop7EWhh0XueNNicCTkvx9eXN3ajkUSbkirBq4Nz13+qBeQkSGAPTKBBf+ja2VGouihwTYGscoUDGkMZ5x9L00fHJnJcgs5h5zJsucQ0E0T+S2smxYZf+ZtM7ydmHgUFK07V594xhe/RU0YOrl4Zyvv6biyOCziJJT+tYs1CtaoIz6Lk15sL549PasrB5vMxYa6U9iMrJRZYA+131fnLkuIaIfCEjzPORc6YpWJasEgPPKohX4Hvsl9xfDfQNhH76bNx78xFc9CNTmNdCcme7mnThB02BdIwipjkwANFtk6eaVvqCbI0Rcvo3kh2W4ieq+78cbe9/nppuEdyqI1+yUZj7p6+XeCBPmN8CZq1jNRasCKqnd64MiN2Ci8DM3F6LFTdh+YY7plH3QeYeE1ZJROuTNO/q0Yq8ApeBOLUOFp/ON6NPAL/YHum06v4qh5KBqc7q1VUBMAhu3GynVTefAUypuPKdLJmE0mp9D8DeqpzxfVXpfI9V+2fewAM0ckueU3wQkGaKn7yIzH F5m24CN0 1bqLqOFwlyTaB+AePSUWeZy75UmAv6oE1krjKtBOqffecMpcBk7MY1e8XkPQUqsB6X9wevslZupJxzsPinvBK1KHsqQWcKbkQwQhbzBK+fOXAzgeFIayMZD7vITaik8nmxFDwXE5aHqn+DdD9cwILL1+zstfmNhWt6jSrqIaPtTB8dbL6EA/FLNoNwgt/OOky9Y9nqml2lMH3GAatMkjCtwJNCAKNbaq7Ucf/oTXErrmhbVQa+OWej6ztG1psGb6cIdVxVSwmb3OQGeJ7j0LsAbSE9sR/VULZBDe/hXWIfNnBEoTthHlevneA+z8dnJuny3hen6vLU9THL+VBHQC8oQuxQCf+GiWz+Ylnshsc6js+2fYLfAgOfqBWAJHgBWUOYyZ7PmaYdC3txDqgfP1dPTN0UJTE5UZNtodAuGx8x4oS6jG2rcSYJf/dtHcpq40ww5gMfqCi8d5/Xj8Bvr+WMOQADoGkNu2lpKezj/WMmmlJf9yUrJZf+5Xs9iMkQmNJ66UXpbY4QTeExrCVkpgBA9VOzDSTEuFEI3LjO8mxPwDrNUuTcPp7roThz2Zp4A8l+Jz+uXFXM6+Bi+qYc/fBJzGdmbgfvlFUSwtEfudBdmbztuE0/iIO0yEOjsZ4RHxB0U8EzR2MKvkCpojqMZFZg0XS6cthOXxVPCDP44lI75IXMp29vDPKPPnXDbgpIWl5NtuJrAo5KknFXd4HVRIgNOa+h/moE4M1y18uVf+pw7s7Q0tVKBbfimPfytA7YaTQHldqnCqi9KZDh4oDajbGO68rkYWpqs34z/VrvK45prIH7++oIHoYPLEV20BWUwOeF5ykrvd3QGbt2wXYIJOHeLEFnLvKevcH1KyYpc+an7bNT9HZJKiA+xQ5EPmcTF1XIn4L X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, On Fri, Apr 14, 2023 at 06:30:43PM +0200, Thomas Gleixner wrote: > Commit 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") tried to > address a race condition between percpu_counter_sum() and a concurrent CPU > hotplug operation. > > The race window is between the point where an un-plugged CPU removed itself > from the online_cpu_mask and the hotplug state callback which folds the per > CPU counters of the now dead CPU into the global count. > > percpu_counter_sum() used for_each_online_cpu() to accumulate the per CPU > local counts, so during the race window it missed to account for the not > yet folded back local count of the offlined CPU. > > The attempt to address this used the admittedly undocumented and > pointlessly public cpu_dying_mask by changing the loop iterator to take > both the cpu_online_mask and the cpu_dying_mask into account. > > That works to some extent, but it is incorrect. > > The cpu_dying_mask bits are sticky even after cpu_up()/cpu_down() > completes. That means that all offlined CPUs are always taken into > account. In the case of disabling SMT at boottime or runtime this results > in evaluating _all_ offlined SMT siblings counters forever. Depending on > system size, that's a massive amount of cache-lines to be touched forever. > > It might be argued, that the cpu_dying_mask bit could be cleared when > cpu_down() completes, but that's not possible under all circumstances. > > Especially with partial hotplug the bit must be sticky in order to keep the > initial user, i.e. the scheduler correct. Partial hotplug which allows > explicit state transitions also can create a situation where the race > window gets recreated: > > cpu_down(target = CPUHP_PERCPU_CNT_DEAD + 1) > > brings a CPU down to one state before the per CPU counter folding > callback. As this did not reach CPUHP_OFFLINE state the bit would stay set. > Now the next partial operation: > > cpu_up(target = CPUHP_PERCPU_CNT_DEAD + 2) > > has to clear the bit and the race window is open again. > > There are two ways to solve this: > > 1) Maintain a local CPU mask in the per CPU counter code which > gets the bit set when a CPU comes online and removed in the > the CPUHP_PERCPU_CNT_DEAD state after folding. > > This adds more code and complexity. > > 2) Move the folding hotplug state into the DYING callback section, which > runs on the outgoing CPU immediatedly after it cleared its online bit. > > There is no concurrency vs. percpu_counter_sum() on another CPU > because all still online CPUs are waiting in stop_machine() for the > outgoing CPU to complete its shutdown. The raw spinlock held around > the CPU mask iteration prevents that an online CPU reaches the stop > machine thread while iterating, which implicitely prevents the > outgoing CPU from clearing its online bit. > > This is way simpler than #1 and makes the hotplug calls symmetric for > the price of a slightly longer wait time in stop_machine(), which is > not the end of the world as CPU un-plug is already slow. The overall > time for a cpu_down() operation stays exactly the same. > > Implement #2 and plug the race completely. > > percpu_counter_sum() is still inherently racy against a concurrent > percpu_counter_add_batch() fastpath unless externally serialized. That's > completely independent of CPU hotplug though. > > Fixes: 8b57b11cca88 ("pcpcntrs: fix dying cpu summation race") > Signed-off-by: Thomas Gleixner > Cc: Dennis Zhou > Cc: Tejun Heo > Cc: Christoph Lameter > Cc: Dave Chinner > Cc: Yury Norov > Cc: Andy Shevchenko > Cc: Rasmus Villemoes > Cc: Ye Bin > Cc: linux-mm@kvack.org > --- > include/linux/cpuhotplug.h | 2 - > lib/percpu_counter.c | 57 +++++++++++++++++++-------------------------- > 2 files changed, 26 insertions(+), 33 deletions(-) > > --- a/include/linux/cpuhotplug.h > +++ b/include/linux/cpuhotplug.h > @@ -91,7 +91,6 @@ enum cpuhp_state { > CPUHP_PRINTK_DEAD, > CPUHP_MM_MEMCQ_DEAD, > CPUHP_XFS_DEAD, > - CPUHP_PERCPU_CNT_DEAD, > CPUHP_RADIX_DEAD, > CPUHP_PAGE_ALLOC, > CPUHP_NET_DEV_DEAD, > @@ -196,6 +195,7 @@ enum cpuhp_state { > CPUHP_AP_SMPCFD_DYING, > CPUHP_AP_X86_TBOOT_DYING, > CPUHP_AP_ARM_CACHE_B15_RAC_DYING, > + CPUHP_AP_PERCPU_COUNTER_STARTING, > CPUHP_AP_ONLINE, > CPUHP_TEARDOWN_CPU, > > --- a/lib/percpu_counter.c > +++ b/lib/percpu_counter.c > @@ -12,7 +12,7 @@ > > #ifdef CONFIG_HOTPLUG_CPU > static LIST_HEAD(percpu_counters); > -static DEFINE_SPINLOCK(percpu_counters_lock); > +static DEFINE_RAW_SPINLOCK(percpu_counters_lock); > #endif > > #ifdef CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER > @@ -126,13 +126,8 @@ EXPORT_SYMBOL(percpu_counter_sync); > * Add up all the per-cpu counts, return the result. This is a more accurate > * but much slower version of percpu_counter_read_positive(). > * > - * We use the cpu mask of (cpu_online_mask | cpu_dying_mask) to capture sums > - * from CPUs that are in the process of being taken offline. Dying cpus have > - * been removed from the online mask, but may not have had the hotplug dead > - * notifier called to fold the percpu count back into the global counter sum. > - * By including dying CPUs in the iteration mask, we avoid this race condition > - * so __percpu_counter_sum() just does the right thing when CPUs are being taken > - * offline. > + * Note: This function is inherently racy against the lockless fastpath of > + * percpu_counter_add_batch() unless externaly serialized. > */ > s64 __percpu_counter_sum(struct percpu_counter *fbc) > { > @@ -142,10 +137,8 @@ s64 __percpu_counter_sum(struct percpu_c > > raw_spin_lock_irqsave(&fbc->lock, flags); > ret = fbc->count; > - for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) { > - s32 *pcount = per_cpu_ptr(fbc->counters, cpu); > - ret += *pcount; > - } > + for_each_online_cpu(cpu) > + ret += *per_cpu_ptr(fbc->counters, cpu); > raw_spin_unlock_irqrestore(&fbc->lock, flags); > return ret; > } > @@ -167,9 +160,9 @@ int __percpu_counter_init(struct percpu_ > > #ifdef CONFIG_HOTPLUG_CPU > INIT_LIST_HEAD(&fbc->list); > - spin_lock_irqsave(&percpu_counters_lock, flags); > + raw_spin_lock_irqsave(&percpu_counters_lock, flags); > list_add(&fbc->list, &percpu_counters); > - spin_unlock_irqrestore(&percpu_counters_lock, flags); > + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags); > #endif > return 0; > } > @@ -185,9 +178,9 @@ void percpu_counter_destroy(struct percp > debug_percpu_counter_deactivate(fbc); > > #ifdef CONFIG_HOTPLUG_CPU > - spin_lock_irqsave(&percpu_counters_lock, flags); > + raw_spin_lock_irqsave(&percpu_counters_lock, flags); > list_del(&fbc->list); > - spin_unlock_irqrestore(&percpu_counters_lock, flags); > + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags); > #endif > free_percpu(fbc->counters); > fbc->counters = NULL; > @@ -197,22 +190,29 @@ EXPORT_SYMBOL(percpu_counter_destroy); > int percpu_counter_batch __read_mostly = 32; > EXPORT_SYMBOL(percpu_counter_batch); > > -static int compute_batch_value(unsigned int cpu) > +static void compute_batch_value(int offs) > { > - int nr = num_online_cpus(); > + int nr = num_online_cpus() + offs; > + > + percpu_counter_batch = max(32, nr * 2); > +} > > - percpu_counter_batch = max(32, nr*2); > +static int percpu_counter_cpu_starting(unsigned int cpu) > +{ > + /* If invoked during hotplug @cpu is not yet marked online. */ > + compute_batch_value(cpu_online(cpu) ? 0 : 1); > return 0; > } > > -static int percpu_counter_cpu_dead(unsigned int cpu) > +static int percpu_counter_cpu_dying(unsigned int cpu) > { > #ifdef CONFIG_HOTPLUG_CPU > struct percpu_counter *fbc; > + unsigned long flags; > > - compute_batch_value(cpu); > + compute_batch_value(0); > > - spin_lock_irq(&percpu_counters_lock); > + raw_spin_lock_irqsave(&percpu_counters_lock, flags); > list_for_each_entry(fbc, &percpu_counters, list) { > s32 *pcount; > > @@ -222,7 +222,7 @@ static int percpu_counter_cpu_dead(unsig > *pcount = 0; > raw_spin_unlock(&fbc->lock); > } > - spin_unlock_irq(&percpu_counters_lock); > + raw_spin_unlock_irqrestore(&percpu_counters_lock, flags); > #endif > return 0; > } > @@ -256,15 +256,8 @@ EXPORT_SYMBOL(__percpu_counter_compare); > > static int __init percpu_counter_startup(void) > { > - int ret; > - > - ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "lib/percpu_cnt:online", > - compute_batch_value, NULL); > - WARN_ON(ret < 0); > - ret = cpuhp_setup_state_nocalls(CPUHP_PERCPU_CNT_DEAD, > - "lib/percpu_cnt:dead", NULL, > - percpu_counter_cpu_dead); > - WARN_ON(ret < 0); > + WARN_ON(cpuhp_setup_state(CPUHP_AP_PERCPU_COUNTER_STARTING, "lib/percpu_counter:starting", > + percpu_counter_cpu_starting, percpu_counter_cpu_dying)); > return 0; > } > module_init(percpu_counter_startup); > Thanks for this work. This is a much more complete solution. Acked-by: Dennis Zhou Thanks, Dennis