From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED03DC25B78 for ; Tue, 28 May 2024 13:46:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 827E96B00A1; Tue, 28 May 2024 09:46:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D8536B00A2; Tue, 28 May 2024 09:46:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6793B6B00A3; Tue, 28 May 2024 09:46:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 48A526B00A1 for ; Tue, 28 May 2024 09:46:15 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 13284C038E for ; Tue, 28 May 2024 13:46:15 +0000 (UTC) X-FDA: 82167928710.12.4F07AF9 Received: from mail-wm1-f52.google.com (mail-wm1-f52.google.com [209.85.128.52]) by imf24.hostedemail.com (Postfix) with ESMTP id 2C0DE180008 for ; Tue, 28 May 2024 13:46:12 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FnJePuN6; spf=pass (imf24.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.52 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716903973; a=rsa-sha256; cv=none; b=queDO1Rb8tdJIWorertC4gRXjvH1L/ERV05E6RfBirZvnV3jjvsVWxcPpeL1EmJai0Z6st qBkmPlSOa96wKHwnhk8ozWG6Gp/qx2RVOK7JuWEmfw5xtr/aw3ldVn0qGnl+tinKMsW13/ G/J9nVrFvMIehZLZq6S/b6iFHDR2rRM= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FnJePuN6; spf=pass (imf24.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.52 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716903973; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=30xqSKjOjXw7Y7njzyhoanyTf4RRJTROvAjzt4LX46I=; b=Qpgt8dltzlWvGbcqPYq/LU1HlVKM6ayuHkP8az17sF0jLmM+WvSGQ95e4xYeW5QrBYDrK/ ajSDcEtsO1jVdJpd+3X6fh11vCtccpHYCfyL8TjvAS6Nu2Z96MEwryG4U7yrUfyeet2WIn g0EYZrlwd/Jq3cw3BY8V35gcDYGdbww= Received: by mail-wm1-f52.google.com with SMTP id 5b1f17b1804b1-421140314d5so5862005e9.0 for ; Tue, 28 May 2024 06:46:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716903972; x=1717508772; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=30xqSKjOjXw7Y7njzyhoanyTf4RRJTROvAjzt4LX46I=; b=FnJePuN6Y+YUfdhGjwCUHE2zftn9eLF9I2uTfv99/HDdRxjFByS0HSI7RdyeTEMvxz AoiojsshwZ56P6GqNZhv5e5JOS3JigyVsvwRaQr8QtgjZloHlyMxKC7QHn/zhZmz+ZIg lruGTQJQQN9D5OBWyoCl97uEDkO6V9E1N9inUfA5zoZWoee4O0+6bdjfTmTAQo8EoyCu a3kjx7Zw6eRb27U8Dfd2vTuM7d+oc3L4/iv4ywd1eTsrqflyJs++DC/ksyDNfRJEgGOk zHmTk3tEDnul/ggcLJBwvAsecBDx2i+N0WmlAIfmHINHRQpTiYtOyp+R6vAAhsyslhRn CQPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716903972; x=1717508772; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=30xqSKjOjXw7Y7njzyhoanyTf4RRJTROvAjzt4LX46I=; b=YUvAYPzKDybvjcLHWHnRsZriheqVbmprIZrAgQItX+oQa7lQFR4BzEkR4s7Ig9k09i rHPAYBe+6Q/fuQb60ASToLr9m6yt9Wgp7QS9lP7sGXYGWdneod1F49nPYgZcustF7Z1b aha4sBBCZ25/vETIqFgWzQuCauffLSyCkTqQBJQO3jwOIBvChoKWSMd57WylOi2KZNB9 M6oF93Zauo4GlH9uv2iKayJIZV1LyuQtqrn3ahGtJDKZhpTugmY/ALAZtedte7rB2RgX ioQe4lY2EERTFJgpgCChfT5LF5KaseOPfygKAvGI8PZWiOzgoNxpXfZKrcr253TayykL k3WQ== X-Forwarded-Encrypted: i=1; AJvYcCVlfap9Xd60DrtyoCIH1504n7scQ5R1BmjjJqbgM4CwK15lm6tHiI4AwpQc2RfOvlMWD+I7HWnkZfS4LH+RcHxgw/E= X-Gm-Message-State: AOJu0YxRKUznR8KdE6QfeBQzWk+NDZeaWy/t1DhPIa2fhbXt6GlO8zQ4 LjGuiA8N2ULcNx4hcRC6CuPZ5XQQfLBQ7LdV2KKHxgGqoL8EDc2D X-Google-Smtp-Source: AGHT+IEngAtN70Li5molD+j7snS98JGFYbHNecNTW6v/xiqDVGMoFnXuoLD/EeV3zEBe4zC4g9taDA== X-Received: by 2002:a1c:7902:0:b0:41b:c024:8e88 with SMTP id 5b1f17b1804b1-42108a0dc79mr77304435e9.33.1716903971518; Tue, 28 May 2024 06:46:11 -0700 (PDT) Received: from f (cst-prg-92-138.cust.vodafone.cz. [46.135.92.138]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-421089ae960sm142726565e9.31.2024.05.28.06.46.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 May 2024 06:46:09 -0700 (PDT) Date: Tue, 28 May 2024 15:44:44 +0200 From: Mateusz Guzik To: Hugh Dickins Cc: Andrew Morton , Tim Chen , Dave Chinner , "Darrick J. Wong" , Christian Brauner , Carlos Maiolino , Chuck Lever , Jan Kara , Matthew Wilcox , Johannes Weiner , Axel Rasmussen , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Message-ID: <5eemkb4lo5eefp7ijgncgogwmadyzmvjfjmmmvfiki6cwdskfs@hi2z4drqeuz6> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2C0DE180008 X-Stat-Signature: qqxkumqggh78fat1utgdc7rijf38xfrb X-HE-Tag: 1716903972-626371 X-HE-Meta: U2FsdGVkX18nqHcj/cVnhgUhDvADf8MWLn504oMhII/Wtm+ymz5NZ9mTLpWZubzK4oSoMR9CetVh4iGQ8CmmVea1j6UgC9d04yxG/ovk8ofRdPX7r3LqQ36nMZz7zZaqWvVjTrmCFmLei5jc3EiZH1lohurMRxAmjicGaAIkvtMye4oaFvpl1+biLbPrx4kcEBoY026efcO9VPoVp0aSa2fIodx9x9zoo4xmFovNOm9ylZySNHNFxd8i1T7iFHjnp93X4mRuO4lEV+iP6+jQv6/WSzz6rl4mq1HBnpCUIzG31LSsmkI2a39VMdA3Uw7J2mo+25LMWZHHqascBp/G+z0wfr8cL+v9FjgZ5uNtIFkHzJ6QdHD1jlnZcpONUqNavIv9YevE4QZmVitRfppZqBTCytnUoHouDLRx/432Y/Qu2Jmf75DbIbjdPeVyMU1qrZJnElKSCx69rSTqp52/M+OihXQkE7j041V6nF/zezPaYDy/nHr+tJ3Zhbz0Si8KHWWdAzeI5z1CReTpD8GA/AsQG3xWc1/5BY410x+kd9zQRsuHmX5QIPiMoEhs+CZXngfzm0UFObkcJdmWbMCWJ+93FiAg3LIGCg7kup6voVLQRAeuWgniC04GzmnWPAqsKahADneidAeJ2jTtWS0Wfe4IDXIzXHORzxOdrGmBTYv4UqsCSHKMfPUoLGshexKXXVI1BmxTe0UhYVTPYXnFOx3WxBCXUb2Tqa0NfuNb7q+qgynTq+ZtwOSSgqmIBZplbOQ+extAwkbVd1f2fnFKSuF/RZugflnOM70LMdCA0QBKT0WORERWfe5caenYLruUuYf81L81YE92ZHaFXxRKgSMYeFvWCgha8W4fs9tmPnlgDTOkdst5pTZIM4TT6WnAbTWdf3hhphGSVAHfC4Gku9pXqEN4oARpZnc5cVgjPAlzlov/aV4CeNGD3WVYmw0meZ/ILiR2PERI6nT+Z84 IiH3dX5I XivZmZyCSN/h+4qsb82XbYrtSdRM8enVpjRmdohKH+bOuizjaH8TRu1pCUGjaTLo9SRCStIqfRyGMlQ/OAeyhox7McDCc7HlHfhwH8ARnDNsoB6iyd0bGka3xyiiqwgRrA6kjrg06WPAnDLz9iB+cBD+/tTZuKyFjYIabdHZtKjEng/8E166Bq69ydL+ihoCTiI98naIK3PrHqEi/wK5ybvnKfUbWx9aRPjZlWbdbgKMOwlCeaWZX0w+HDYImMkNy0WGtbiHnu+f3LS4AFL+qwHrjlDP8n+PlwpmTYpRnr08q7wCvJ9ZXcU5uQEVFtSXkafG0Ylc5r0RDLIWJfXKlhwd7huDkeCLV9a7nI0A+QtXxuc7IMthwTO7E+6+uqwJ5k6EQTUeuNTTu3zYFa8ljsPfiE6rIQWJoZCFjqVaZu6B3m58rpWYiNlK9W/6LQe7krm2ljJ+9UdiOlOCAhS8jolrPo0+LbaLFU6fs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 25, 2024 at 08:00:15AM +0200, Mateusz Guzik wrote: > On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote: > > Percpu counter's compare and add are separate functions: without locking > > around them (which would defeat their purpose), it has been possible to > > overflow the intended limit. Imagine all the other CPUs fallocating > > tmpfs huge pages to the limit, in between this CPU's compare and its add. > > > > I have not seen reports of that happening; but tmpfs's recent addition > > of dquot_alloc_block_nodirty() in between the compare and the add makes > > it even more likely, and I'd be uncomfortable to leave it unfixed. > > > > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it. > > > > I think the posted (and by now landed) code is racy. > > I had seen there was a follow up patch which further augmented the > routine, but it did not alter the issue below so I'm responding to this > thread. > > > +/* > > + * Compare counter, and add amount if the total is within limit. > > + * Return true if amount was added, false if it would exceed limit. > > + */ > > +bool __percpu_counter_limited_add(struct percpu_counter *fbc, > > + s64 limit, s64 amount, s32 batch) > > +{ > > + s64 count; > > + s64 unknown; > > + unsigned long flags; > > + bool good; > > + > > + if (amount > limit) > > + return false; > > + > > + local_irq_save(flags); > > + unknown = batch * num_online_cpus(); > > + count = __this_cpu_read(*fbc->counters); > > + > > + /* Skip taking the lock when safe */ > > + if (abs(count + amount) <= batch && > > + fbc->count + unknown <= limit) { > > + this_cpu_add(*fbc->counters, amount); > > + local_irq_restore(flags); > > + return true; > > + } > > + > > Note the value of fbc->count is *not* stabilized. > > > + raw_spin_lock(&fbc->lock); > > + count = fbc->count + amount; > > + > > + /* Skip percpu_counter_sum() when safe */ > > + if (count + unknown > limit) { > > + s32 *pcount; > > + int cpu; > > + > > + for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) { > > + pcount = per_cpu_ptr(fbc->counters, cpu); > > + count += *pcount; > > + } > > + } > > + > > + good = count <= limit; > > + if (good) { > > + count = __this_cpu_read(*fbc->counters); > > + fbc->count += count + amount; > > + __this_cpu_sub(*fbc->counters, count); > > + } > > + > > + raw_spin_unlock(&fbc->lock); > > + local_irq_restore(flags); > > + return good; > > +} > > + > > Consider 2 cpus executing the func at the same time, where one is > updating fbc->counter in the slow path while the other is testing it in > the fast path. > > cpu0 cpu1 > if (abs(count + amount) <= batch && > fbc->count + unknown <= limit) > /* gets past the per-cpu traversal */ > /* at this point cpu0 decided to bump fbc->count, while cpu1 decided to > * bump the local counter */ > this_cpu_add(*fbc->counters, amount); > fbc->count += count + amount; > > Suppose fbc->count update puts it close enough to the limit that an > addition from cpu1 would put the entire thing over said limit. > > Since the 2 operations are not synchronized cpu1 can observe fbc->count > prior to the bump and update it's local counter, resulting in > aforementioned overflow. > > Am I misreading something here? Is this prevented? > > To my grep the only user is quotas in shmem and I wonder if that use is > even justified. I am aware of performance realities of atomics. However, > it very well may be that whatever codepaths which exercise the counter > are suffering multicore issues elsewhere, making an atomic (in a > dedicated cacheline) a perfectly sane choice for the foreseeable future. > Should this be true there would be 0 rush working out a fixed variant of > the routine. So I tried it out and I don't believe a per-cpu counter for mem usage of shmem is warranted at this time. In a setting where usage keeps changing there is a massive bottleneck around folio_lruvec_lock. The rare case where a bunch of memory is just populated one off on a tmpfs mount can probably suffer the atomic, it can only go so far up before you are out of memory. For the value to keep changing some of the pages will have to be released and this is what I'm testing below by slapping ftruncate on a test doing 1MB file writes in a loop (in will-it-scale): diff --git a/tests/write1.c b/tests/write1.c index d860904..790ddb3 100644 --- a/tests/write1.c +++ b/tests/write1.c @@ -28,6 +28,7 @@ void testcase(unsigned long long *iterations, unsigned long nr) size = 0; lseek(fd, 0, SEEK_SET); } + ftruncate(fd, 0); (*iterations)++; } That this is allocates pages for 1MB size and releases them continously. When mounting tmpfs on /tmp and benching with "./write1_processes -t 20" (20 workers) I see almost all of the time spent spinning in __pv_queued_spin_lock_slowpath. As such I don't believe the per-cpu split buys anything in terms of scalability and as I explained in the previous mail I think the routine used here is buggy, while shmem is the only user. Should this switch to a centralized atomic (either cmpxchg loop or xadd + backpedal) there should be no loss in scalability given the lruvec problem. Then the routine could be commented out or whacked for the time being. backtraces for interested: bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }' @[ __pv_queued_spin_lock_slowpath+5 _raw_spin_lock_irqsave+61 folio_lruvec_lock_irqsave+95 __page_cache_release+130 folios_put_refs+139 shmem_undo_range+702 shmem_setattr+983 notify_change+556 do_truncate+131 do_ftruncate+254 __x64_sys_ftruncate+62 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+118 ]: 4750931 @[ __pv_queued_spin_lock_slowpath+5 _raw_spin_lock_irqsave+61 folio_lruvec_lock_irqsave+95 folio_batch_move_lru+139 lru_add_drain_cpu+141 __folio_batch_release+49 shmem_undo_range+702 shmem_setattr+983 notify_change+556 do_truncate+131 do_ftruncate+254 __x64_sys_ftruncate+62 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+118 ]: 4761483