From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ED03DC25B78
	for <linux-mm@archiver.kernel.org>; Tue, 28 May 2024 13:46:15 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 827E96B00A1; Tue, 28 May 2024 09:46:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7D8536B00A2; Tue, 28 May 2024 09:46:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6793B6B00A3; Tue, 28 May 2024 09:46:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 48A526B00A1
	for <linux-mm@kvack.org>; Tue, 28 May 2024 09:46:15 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 13284C038E
	for <linux-mm@kvack.org>; Tue, 28 May 2024 13:46:15 +0000 (UTC)
X-FDA: 82167928710.12.4F07AF9
Received: from mail-wm1-f52.google.com (mail-wm1-f52.google.com [209.85.128.52])
	by imf24.hostedemail.com (Postfix) with ESMTP id 2C0DE180008
	for <linux-mm@kvack.org>; Tue, 28 May 2024 13:46:12 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=FnJePuN6;
	spf=pass (imf24.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.52 as permitted sender) smtp.mailfrom=mjguzik@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716903973; a=rsa-sha256;
	cv=none;
	b=queDO1Rb8tdJIWorertC4gRXjvH1L/ERV05E6RfBirZvnV3jjvsVWxcPpeL1EmJai0Z6st
	qBkmPlSOa96wKHwnhk8ozWG6Gp/qx2RVOK7JuWEmfw5xtr/aw3ldVn0qGnl+tinKMsW13/
	G/J9nVrFvMIehZLZq6S/b6iFHDR2rRM=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=FnJePuN6;
	spf=pass (imf24.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.128.52 as permitted sender) smtp.mailfrom=mjguzik@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1716903973;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=30xqSKjOjXw7Y7njzyhoanyTf4RRJTROvAjzt4LX46I=;
	b=Qpgt8dltzlWvGbcqPYq/LU1HlVKM6ayuHkP8az17sF0jLmM+WvSGQ95e4xYeW5QrBYDrK/
	ajSDcEtsO1jVdJpd+3X6fh11vCtccpHYCfyL8TjvAS6Nu2Z96MEwryG4U7yrUfyeet2WIn
	g0EYZrlwd/Jq3cw3BY8V35gcDYGdbww=
Received: by mail-wm1-f52.google.com with SMTP id 5b1f17b1804b1-421140314d5so5862005e9.0
        for <linux-mm@kvack.org>; Tue, 28 May 2024 06:46:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1716903972; x=1717508772; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=30xqSKjOjXw7Y7njzyhoanyTf4RRJTROvAjzt4LX46I=;
        b=FnJePuN6Y+YUfdhGjwCUHE2zftn9eLF9I2uTfv99/HDdRxjFByS0HSI7RdyeTEMvxz
         AoiojsshwZ56P6GqNZhv5e5JOS3JigyVsvwRaQr8QtgjZloHlyMxKC7QHn/zhZmz+ZIg
         lruGTQJQQN9D5OBWyoCl97uEDkO6V9E1N9inUfA5zoZWoee4O0+6bdjfTmTAQo8EoyCu
         a3kjx7Zw6eRb27U8Dfd2vTuM7d+oc3L4/iv4ywd1eTsrqflyJs++DC/ksyDNfRJEgGOk
         zHmTk3tEDnul/ggcLJBwvAsecBDx2i+N0WmlAIfmHINHRQpTiYtOyp+R6vAAhsyslhRn
         CQPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1716903972; x=1717508772;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=30xqSKjOjXw7Y7njzyhoanyTf4RRJTROvAjzt4LX46I=;
        b=YUvAYPzKDybvjcLHWHnRsZriheqVbmprIZrAgQItX+oQa7lQFR4BzEkR4s7Ig9k09i
         rHPAYBe+6Q/fuQb60ASToLr9m6yt9Wgp7QS9lP7sGXYGWdneod1F49nPYgZcustF7Z1b
         aha4sBBCZ25/vETIqFgWzQuCauffLSyCkTqQBJQO3jwOIBvChoKWSMd57WylOi2KZNB9
         M6oF93Zauo4GlH9uv2iKayJIZV1LyuQtqrn3ahGtJDKZhpTugmY/ALAZtedte7rB2RgX
         ioQe4lY2EERTFJgpgCChfT5LF5KaseOPfygKAvGI8PZWiOzgoNxpXfZKrcr253TayykL
         k3WQ==
X-Forwarded-Encrypted: i=1; AJvYcCVlfap9Xd60DrtyoCIH1504n7scQ5R1BmjjJqbgM4CwK15lm6tHiI4AwpQc2RfOvlMWD+I7HWnkZfS4LH+RcHxgw/E=
X-Gm-Message-State: AOJu0YxRKUznR8KdE6QfeBQzWk+NDZeaWy/t1DhPIa2fhbXt6GlO8zQ4
	LjGuiA8N2ULcNx4hcRC6CuPZ5XQQfLBQ7LdV2KKHxgGqoL8EDc2D
X-Google-Smtp-Source: AGHT+IEngAtN70Li5molD+j7snS98JGFYbHNecNTW6v/xiqDVGMoFnXuoLD/EeV3zEBe4zC4g9taDA==
X-Received: by 2002:a1c:7902:0:b0:41b:c024:8e88 with SMTP id 5b1f17b1804b1-42108a0dc79mr77304435e9.33.1716903971518;
        Tue, 28 May 2024 06:46:11 -0700 (PDT)
Received: from f (cst-prg-92-138.cust.vodafone.cz. [46.135.92.138])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-421089ae960sm142726565e9.31.2024.05.28.06.46.06
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 28 May 2024 06:46:09 -0700 (PDT)
Date: Tue, 28 May 2024 15:44:44 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, 
	Tim Chen <tim.c.chen@intel.com>, Dave Chinner <dchinner@redhat.com>, 
	"Darrick J. Wong" <djwong@kernel.org>, Christian Brauner <brauner@kernel.org>, 
	Carlos Maiolino <cem@kernel.org>, Chuck Lever <chuck.lever@oracle.com>, Jan Kara <jack@suse.cz>, 
	Matthew Wilcox <willy@infradead.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Axel Rasmussen <axelrasmussen@google.com>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org
Subject: Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit,
 amount)
Message-ID: <5eemkb4lo5eefp7ijgncgogwmadyzmvjfjmmmvfiki6cwdskfs@hi2z4drqeuz6>
References: <c7441dc6-f3bb-dd60-c670-9f5cbd9f266@google.com>
 <bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.com>
 <jh3yqdz43c24ur7w2jjutyvwodsdccefo6ycmtmjyvh25hojn4@aysycyla6pom>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <jh3yqdz43c24ur7w2jjutyvwodsdccefo6ycmtmjyvh25hojn4@aysycyla6pom>
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 2C0DE180008
X-Stat-Signature: qqxkumqggh78fat1utgdc7rijf38xfrb
X-HE-Tag: 1716903972-626371
X-HE-Meta: U2FsdGVkX18nqHcj/cVnhgUhDvADf8MWLn504oMhII/Wtm+ymz5NZ9mTLpWZubzK4oSoMR9CetVh4iGQ8CmmVea1j6UgC9d04yxG/ovk8ofRdPX7r3LqQ36nMZz7zZaqWvVjTrmCFmLei5jc3EiZH1lohurMRxAmjicGaAIkvtMye4oaFvpl1+biLbPrx4kcEBoY026efcO9VPoVp0aSa2fIodx9x9zoo4xmFovNOm9ylZySNHNFxd8i1T7iFHjnp93X4mRuO4lEV+iP6+jQv6/WSzz6rl4mq1HBnpCUIzG31LSsmkI2a39VMdA3Uw7J2mo+25LMWZHHqascBp/G+z0wfr8cL+v9FjgZ5uNtIFkHzJ6QdHD1jlnZcpONUqNavIv9YevE4QZmVitRfppZqBTCytnUoHouDLRx/432Y/Qu2Jmf75DbIbjdPeVyMU1qrZJnElKSCx69rSTqp52/M+OihXQkE7j041V6nF/zezPaYDy/nHr+tJ3Zhbz0Si8KHWWdAzeI5z1CReTpD8GA/AsQG3xWc1/5BY410x+kd9zQRsuHmX5QIPiMoEhs+CZXngfzm0UFObkcJdmWbMCWJ+93FiAg3LIGCg7kup6voVLQRAeuWgniC04GzmnWPAqsKahADneidAeJ2jTtWS0Wfe4IDXIzXHORzxOdrGmBTYv4UqsCSHKMfPUoLGshexKXXVI1BmxTe0UhYVTPYXnFOx3WxBCXUb2Tqa0NfuNb7q+qgynTq+ZtwOSSgqmIBZplbOQ+extAwkbVd1f2fnFKSuF/RZugflnOM70LMdCA0QBKT0WORERWfe5caenYLruUuYf81L81YE92ZHaFXxRKgSMYeFvWCgha8W4fs9tmPnlgDTOkdst5pTZIM4TT6WnAbTWdf3hhphGSVAHfC4Gku9pXqEN4oARpZnc5cVgjPAlzlov/aV4CeNGD3WVYmw0meZ/ILiR2PERI6nT+Z84
 IiH3dX5I
 XivZmZyCSN/h+4qsb82XbYrtSdRM8enVpjRmdohKH+bOuizjaH8TRu1pCUGjaTLo9SRCStIqfRyGMlQ/OAeyhox7McDCc7HlHfhwH8ARnDNsoB6iyd0bGka3xyiiqwgRrA6kjrg06WPAnDLz9iB+cBD+/tTZuKyFjYIabdHZtKjEng/8E166Bq69ydL+ihoCTiI98naIK3PrHqEi/wK5ybvnKfUbWx9aRPjZlWbdbgKMOwlCeaWZX0w+HDYImMkNy0WGtbiHnu+f3LS4AFL+qwHrjlDP8n+PlwpmTYpRnr08q7wCvJ9ZXcU5uQEVFtSXkafG0Ylc5r0RDLIWJfXKlhwd7huDkeCLV9a7nI0A+QtXxuc7IMthwTO7E+6+uqwJ5k6EQTUeuNTTu3zYFa8ljsPfiE6rIQWJoZCFjqVaZu6B3m58rpWYiNlK9W/6LQe7krm2ljJ+9UdiOlOCAhS8jolrPo0+LbaLFU6fs
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Sat, May 25, 2024 at 08:00:15AM +0200, Mateusz Guzik wrote:
> On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> > Percpu counter's compare and add are separate functions: without locking
> > around them (which would defeat their purpose), it has been possible to
> > overflow the intended limit.  Imagine all the other CPUs fallocating
> > tmpfs huge pages to the limit, in between this CPU's compare and its add.
> > 
> > I have not seen reports of that happening; but tmpfs's recent addition
> > of dquot_alloc_block_nodirty() in between the compare and the add makes
> > it even more likely, and I'd be uncomfortable to leave it unfixed.
> > 
> > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> > 
> 
> I think the posted (and by now landed) code is racy.
> 
> I had seen there was a follow up patch which further augmented the
> routine, but it did not alter the issue below so I'm responding to this
> thread.
> 
> > +/*
> > + * Compare counter, and add amount if the total is within limit.
> > + * Return true if amount was added, false if it would exceed limit.
> > + */
> > +bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> > +				  s64 limit, s64 amount, s32 batch)
> > +{
> > +	s64 count;
> > +	s64 unknown;
> > +	unsigned long flags;
> > +	bool good;
> > +
> > +	if (amount > limit)
> > +		return false;
> > +
> > +	local_irq_save(flags);
> > +	unknown = batch * num_online_cpus();
> > +	count = __this_cpu_read(*fbc->counters);
> > +
> > +	/* Skip taking the lock when safe */
> > +	if (abs(count + amount) <= batch &&
> > +	    fbc->count + unknown <= limit) {
> > +		this_cpu_add(*fbc->counters, amount);
> > +		local_irq_restore(flags);
> > +		return true;
> > +	}
> > +
> 
> Note the value of fbc->count is *not* stabilized.
> 
> > +	raw_spin_lock(&fbc->lock);
> > +	count = fbc->count + amount;
> > +
> > +	/* Skip percpu_counter_sum() when safe */
> > +	if (count + unknown > limit) {
> > +		s32 *pcount;
> > +		int cpu;
> > +
> > +		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> > +			pcount = per_cpu_ptr(fbc->counters, cpu);
> > +			count += *pcount;
> > +		}
> > +	}
> > +
> > +	good = count <= limit;
> > +	if (good) {
> > +		count = __this_cpu_read(*fbc->counters);
> > +		fbc->count += count + amount;
> > +		__this_cpu_sub(*fbc->counters, count);
> > +	}
> > +
> > +	raw_spin_unlock(&fbc->lock);
> > +	local_irq_restore(flags);
> > +	return good;
> > +}
> > +
> 
> Consider 2 cpus executing the func at the same time, where one is
> updating fbc->counter in the slow path while the other is testing it in
> the fast path.
> 
> cpu0						cpu1
> 						if (abs(count + amount) <= batch &&				
> 						    fbc->count + unknown <= limit)
> /* gets past the per-cpu traversal */
> /* at this point cpu0 decided to bump fbc->count, while cpu1 decided to
>  * bump the local counter */
> 							this_cpu_add(*fbc->counters, amount);
> fbc->count += count + amount;
> 
> Suppose fbc->count update puts it close enough to the limit that an
> addition from cpu1 would put the entire thing over said limit.
> 
> Since the 2 operations are not synchronized cpu1 can observe fbc->count
> prior to the bump and update it's local counter, resulting in
> aforementioned overflow.
> 
> Am I misreading something here? Is this prevented?
> 
> To my grep the only user is quotas in shmem and I wonder if that use is
> even justified. I am aware of performance realities of atomics. However,
> it very well may be that whatever codepaths which exercise the counter
> are suffering multicore issues elsewhere, making an atomic (in a
> dedicated cacheline) a perfectly sane choice for the foreseeable future.
> Should this be true there would be 0 rush working out a fixed variant of
> the routine.

So I tried it out and I don't believe a per-cpu counter for mem usage of
shmem is warranted at this time. In a setting where usage keeps changing
there is a massive bottleneck around folio_lruvec_lock.

The rare case where a bunch of memory is just populated one off on a
tmpfs mount can probably suffer the atomic, it can only go so far up
before you are out of memory.

For the value to keep changing some of the pages will have to be
released and this is what I'm testing below by slapping ftruncate on a
test doing 1MB file writes in a loop (in will-it-scale):

diff --git a/tests/write1.c b/tests/write1.c
index d860904..790ddb3 100644
--- a/tests/write1.c
+++ b/tests/write1.c
@@ -28,6 +28,7 @@ void testcase(unsigned long long *iterations, unsigned long nr)
                        size = 0;
                        lseek(fd, 0, SEEK_SET);
                }
+               ftruncate(fd, 0);

                (*iterations)++;
        }

That this is allocates pages for 1MB size and releases them continously.

When mounting tmpfs on /tmp and benching with "./write1_processes -t 20"
(20 workers) I see almost all of the time spent spinning in
__pv_queued_spin_lock_slowpath.

As such I don't believe the per-cpu split buys anything in terms of
scalability and as I explained in the previous mail I think the routine
used here is buggy, while shmem is the only user. Should this switch to
a centralized atomic (either cmpxchg loop or xadd + backpedal) there
should be no loss in scalability given the lruvec problem. Then the
routine could be commented out or whacked for the time being.

backtraces for interested:

bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }'

@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irqsave+61
    folio_lruvec_lock_irqsave+95
    __page_cache_release+130
    folios_put_refs+139
    shmem_undo_range+702
    shmem_setattr+983
    notify_change+556
    do_truncate+131
    do_ftruncate+254
    __x64_sys_ftruncate+62
    do_syscall_64+87
    entry_SYSCALL_64_after_hwframe+118
]: 4750931
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irqsave+61
    folio_lruvec_lock_irqsave+95
    folio_batch_move_lru+139
    lru_add_drain_cpu+141
    __folio_batch_release+49
    shmem_undo_range+702
    shmem_setattr+983
    notify_change+556
    do_truncate+131
    do_ftruncate+254
    __x64_sys_ftruncate+62
    do_syscall_64+87
    entry_SYSCALL_64_after_hwframe+118
]: 4761483